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Message from the Program Chair 
Dear Reader: 


For twenty years, the annual LISA conference has been the foremost 
worldwide gathering for everyone interested in the technical and 
administrative issues of running a large computing facility. We are now 
being expected to manage installations of increasing size and complexity, 
with a higher degree of reliability and performance; I believe that this will 
only be possible through collaboration among different disciplines, from 
practising administrators with many years experience, to academics with 
new insights and approaches. LISA is unique in bringing together this wide 
range of people to address real-world problems. 


These Proceedings contain the complete text of the refereed papers 
presented at the 2007 LISA conference. The 22 papers were selected from a 
total of 55 submissions, and I would hope that practising system 
administrators will find most of these to be relevant and accessible — the 
papers include “‘hot” new topics, such as the management of virtual 
machines, and new perspectives on old problems such as spam and firewall 
configuration. They range from “soft” topics such as system administrator 
education, to more technical issues such as configuration management. 


Putting together the LISA conference has required the energy and 
experience of so many people, and I would like to thank all those who have 
been involved. I hope that you find the contents of these proceedings useful 
and inspiring, and that you may consider submitting your own work to a 
future LISA conference. 


Paul Anderson 
Program Chair 
LISA ’07 
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PolicyVis: Firewall Security Policy 
Visualization and Inspection 


Tung Tran, Ehab Al-Shaer, and Raouf Boutaba — University of Waterloo, Canada 


ABSTRACT 


Firewalls have an important role in network security. However, managing firewall policies is 
an extremely complex task because the large number of interacting rules in single or distributed 
firewalls significantly increases the possibility of policy misconfiguration and network vulnera- 
bilities. Moreover, due to low-level representation of firewall rules, the semantic of firewall 
policies become very incomprehensible, which makes inspecting of firewall policy’s properties a 
difficult and error-prone task. 


In this paper, we propose a tool called PolicyVis which visualizes firewall rules and policies 
in such a way that efficiently enhances the understanding and inspecting firewall policies. Unlike 
previous works that attempt to validate or inspect firewall rules based on specific queries or errors, 
our approach is to visualize firewall policies to enable the user to place general inquiry such as 
“does my policy do what I intend to do” unrestrictedly. We describe the design principals in 
PolicyVis and provide concepts and examples dealing with firewall policy’s properties, rule 
anomalies and distributed firewalls. As a result, PolicyVis considerably simplifies the management 


of firewall policies and hence effectively improves the network security. 


Introduction 


Network security is essential to the development 
of internet and has attracted much attention in research 
and industrial communities. With the increase of net- 
work attack threats, firewalls are considered effective 
network barriers and have become important elements 
not only in enterprise networks but also in small-size 
and home networks. A firewall is a program or a hard- 
ware device to protect a network or a computer system 
by filtering out unwanted network traffic. The filtering 
decision is based on a set of ordered filtering rules writ- 
ten based on predefined security policy requirements. 


Firewalls can be deployed to secure one network 
from another. However, firewalls can be significantly 
ineffective in protecting networks if policies are not 
managed correctly and efficiently. It is very crucial to 
have policy management techniques and tools that 
users can use to examine, refine and verify the correct- 
ness of written firewall filtering rules in order to 
increase the effectiveness of firewall security. 


It is true that humans are well adapted to capture 
data essences and patterns when presented in a way that 
is visually appealing. This truth promotes visualization 
on data, on which the analysis is very hard or ineffec- 
tive to carry out because of its huge volume and com- 
plexity. The amount of data that can be processed and 
analyzed has never been greater, and continues to grow 
rapidly. As the number of filtering rules increases large- 
ly and the policy becomes much more complex, fire- 
wall policy visualization is an indispensable solution to 
policy management. Firewall policy visualization helps 
users understand their policies easily and grasp compli- 
cated rule patterns and behaviors efficiently. 


In this paper, we present PolicyVis, a useful tool 
in visualizing firewall policies. We describe design 
principals, implementations and application examples 
that deal with discovering firewall policy’s properties, 
rule anomalies for single or distributed firewalls. 


Although network security visualization has been 
given strong attention in the research community, the 
emphasis was mostly on the network traffic [4, 8]. On 
the other hand, tools in [12, 16] visualize some fire- 
wall aspects, but don’t give users a thorough look at 
firewall policies. 


This paper is organized as follows. In the next 
section, we summarize related work. We then describe 
PolicyVis design principals followed by descriptions 
of scenarios that show the usefulness of PolicyVis. 
Next, we show how rule anomalies are visualized by 
PolicyVis and demonstrate some examples of deter- 
mining rule anomalies by using PolicyVis. We then 
describe visualization on distributed firewalls in Poli- 
cyVis followed by a disucssion of the implementation 
and evaluation of PolicyVis. Finally, we show conclu- 
sions and plans for future work. 


Related Work 


A significant amount of work has been reported 
in the area of firewall and policy-based security man- 
agement. In this section, we focus our study on the 
work that closely related to PolicyVis’ design objec- 
tives: network security visualization and policy man- 
agement. 


There are many visualization tools introduced to 
enhance network security. PortVis [15] uses port-based 
detection of security activities and visualizes network 
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traffic by choosing important features for axes and dis- 
playing network activities on the graph. SeeNet [3] sup- 
ports three static network displays: two of these use 
geographical relationships, while the third is a matrix 
arrangement that gives equal emphasis to all network 
links. NVisionIP [11] uses a graphical representation of 
a class-B network to allow users to quickly visualize 
the current state of networks. Le Malécot, et al. [14] 
introduced an original visualization design which com- 
bines 2-D and 3-D visualization for network traffic 
monitoring. However, these tools only focus on visual- 
izing network traffic to assist users in understanding 
network events and taking according actions. 


Moreover, previous work on firewall visualiza- 
tion only concentrates on visualizing how firewalls 
react to network traffic based on network events. Chris 
P. Lee, et al. [12] proposed a tool visualizing firewall 
reactions to network traffic to aid users in configura- 
tion of firewalls. FireViz [16] visually displays activi- 
ties of a personal firewall in real time to possibly find 
any potential loop holes in the firewall’s security poli- 
cies. These tools can only detect a small subset of all 
firewall behaviors and can not determine all possible 
potential firewall patterns by looking at the policy 
directly like PolicyVis. Besides, Tufin SecureTrack 
[19] is a commercial firewall operations management 
solution, however, it provides change management 
and version control for firewall policy update. It basi- 
cally visualizes firewall policy version changes, but 
not rule properties and relations and allows users to 
receive alerts if any change violates the organizational 
policy. Thus, Tufin SecureTrack can not be used for 
rules analysis and anomaly discovery. 


In the field of firewall policy management, a fil- 
tering policy translation tool proposed in [2] describes, 
in a natural textual language, the meaning and the 
interactions of all filtering rules in the policy, reveal- 
ing the complete semantics of the policy in a very con- 
cise fashion. However, this tool is not as efficient as 
PolicyVis in helping users capture the policy proper- 
ties quickly in case of huge number of rules in the pol- 
icy. In [1], the authors mentioned firewall policy 
anomalies and techniques to discover them, and sug- 
gested a tool called Firewall Policy Advisor which 
implements anomaly discovery algorithms. However, 
Firewall Policy Advisor is not capable of showing all 
potential behaviors of firewall policies and doesn’t 
help users in telling if a policy does what he wants. 


The authors in [6, 10] suggest methods for de- 
tecting and resolving conflicts among general packet 
filters. However, only correlation anomaly [1] is con- 
sidered because it causes ambiguity in packet classi- 
fiers. In addition, the authors in [13, 18] proposed fire- 
wall analysis tools that allow users to issue cus- 
tomized queries on a set of filtering rules and display 
corresponding outputs in the policy. However, the 
query reply could be overwhelming and still complex to 
understand. PolicyVis output is more comprehensible. 
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Moreover, those tools require users to consider very spe- 
cific issues to inspect the policy. PolicyVis, on the other 
hand, enables users to investigate the policy at once, 
which is more practical and efficient in large policies. 


Policy Vis Objectives 


Information visualization is always an effective 
way to capture and explore large amount of abstract 
data. With the necessity of guaranteeing a correct fire- 
wall behavior, users need to recognize and fix firewall 
misconfigurations in a swift manner. However, the 
complexity of dealing with firewall policies is they are 
attributed to the large number of rules, rules complex- 
ity and rules dependency. Those facts motivate a tool 
which visualizes all firewall rules in such a way that 
rule interactions are easily grasped and analyzed in 
order to come up with an opportune solution to any 
firewall security breach. 


PolicyVis is a visualization tool of firewall poli- 
cies helps users to achieve the following goals in an 
effective and fast fashion: 

° Visualizing rule conditions, address space and 
action: a firewall policy is attributed by rules for- 
mat, rules dependency and matching semantics. 
Comprehensive visualization of firewall policies 
requires a mechanism of transforming firewall 
rules to visual elements which significantly en- 
hance the investigation of policies. PolicyVis ef- 
fectively visualizes all firewall rule core elements: 
conditions, address space and action. 

Firewall policy semantic discovery: it is a 
very normal demand of users to know all possi- 
ble behaviors of a firewall to its intended pro- 
tected system. With advantages of visualization 
and many graphic options supported by Poli- 
cyVis, all potential firewall behaviors can be 
easily discovered, which are normally very hard 
to grasp in a usual context. 

Firewall policy rule conflict discovery: Poli- 
cyVis can be able to not only give users a view 
on normal rule interactions, but also pinpoint all 
possible rule anomalies in the policy. This is a 
crucial feature of PolicyVis to become a very 
helpful tool for users. All kind of rule conflicts 
can be efficiently visualized without worrying 
about running any algorithm to find potential 
rule conflicts. 

Firewall policy inspection based on users’ in- 
tention: with a policy of thousands of rules, it is 
much likely that the user will make configuration 
mistakes (not rule conflicts mentioned above) in 
the policy which causes the firewall to function 
incorrectly. PolicyVis brings all firewall rules to a 
graphic view so that all configuration mistakes 
are highlighted without any difficulty. 
Visualizing distributed firewalls: distributed 
firewalls security is as important as a single 
firewall, besides visualizing a single firewall. 
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PolicyVis also lets users visualize distributed 
firewalls with the same efficiency in all goals 
mentioned above. 


PolicyVis Design Principles 


The fundamental design requirements for Poli- 
cyVis included: 

e Simplicity: It should be fairly intuitive for users 
to inspect firewall policies in a 2D graph using 
multiple fields. We chose to compress firewall 
rules into 2D graph with three factors because it 
is much likely that a certain field (like source 
port) can be ignored or not important when 
investigating the policy. 2D graph is simple but 
quite effective in terms of helping users thor- 
oughly look into the policy’s behaviors. 
Expressiveness: It is very important that users 
can easily capture true rule interactions so that 
appropriate actions can be taken immediately. 
PolicyVis supports very detailed and thorough 
visualization of all possible firewall rules’ be- 
haviors by considering all rules fields, rule 
orders as well as all rule actions. 

Flexible Visualization scope: PolicyVis allows 
users to visualize what they are interested in 
(the target: any rule field) so that all possible 
aspects of the policy can be viewed and ana- 
lyzed. Moreover, with multiple dimensions sup- 
port, PolicyVis is flexible in letting users choose 
desired fields for the graph coordinates, which is 
convenient and effective to observe and investi- 
gate the policy from different views. Besides, 
there are choices on type of traffic (accepted, 
denied or both) which can be viewed separately 
to meet users’ different purposes. 
Ability to Compress, Focus and Zoom: It is a 
normal thing to take a closer look at a specific 
set of rules when investigating the policy. Poli- 
cyVis supports zooming so that users can closely 
investigate a set of considered rules. This zoom- 
ing feature is very useful if too many rules get 
involved in the investigation and the axes get 
crowded. In addition, PolicyVis gives users the 
ability to investigate rule anomalies existing in 
the policy through the focusing feature. With 
PolicyVis, users can also visualize the whole 
policy at once as well as portions of the policy 
partitioned by ranges of a specific field. This is a 
helpful feature of PolicyVis when users want to 
consider the policy applied to a subnet or a 
desired portion of the network. 
Ability to use policy segmentation: In order to 
investigate accepted or denied traffic only, pol- 
icy segmentation with BDDs technique [5] is a 
powerful means employed by PolicyVis to in- 
crease the effectiveness and correctness of ex- 
tracting useful information from the policy. 
e Ability to use symbols, colors, notations: 
Policies are attributed by rules format, rules 
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dependency and matching semantics (rule order). 
Moreover, firewall rules contain conditions (pro- 
tocol, port and address), values (specific and 
wildcard) and actions (allowed and denied). 
PolicyVis visualizes those features using colors, 
symbols, and notations which are essential for 
users to capture quickly and easily the inside 
interactions and performance of firewall poli- 
cies. 


Multi-level Visualizing of Firewall Policies 


Using PolicyVis, multi-level visualizing of fire- 
wall policies can be accomplished effectively. With 
PolicyVis’ many flexible features, user can inspect the 
firewall policy from different views (like port level, 
address level, etc.) to understand all potential inside 
behaviors of the policy. In order to achieve this goal, 
PolicyVis deploys many methods and techniques which 
efficiently bring firewall policies to expressive visual 
views. 


Using BDDs To Segment Policy and Find Accepted 
and Denied Spaces 


Firewall policy segmentation using Binary Deci- 
sion Diagram or BDD was first introduced by our 
group in [5, 9] to enhance the firewall validation and 
testing procedures. As defined in [5], a segment is a 
subset of the total traffic address space that belongs to 
one or more rules, such that each of the member ele- 
ments (i.e., header tuples) conforms to exactly the 
same set of policy rules. Rules and address spaces are 
represented as Boolean expressions and BDD is used 
as a standard canonical method to represent Boolean 
expressions. By taking advantages of BDD’s proper- 
ties, firewall rules are effectively segmented into dis- 
joint segments each of which belong to either accepted 
or denied space. 


[R1: tcp 121.63.*.*: any 143.91.78.*: any —_accept| 
|R2: tcp 121.63.71.*: any  143.91.*.*: any accept 
R3: any *.*.%.%: aNy *.*.*.4: any den 


enone 





a 


eo 


(a) Policy address-space 
Table 1: Example of firewall policy segmentation. 


= 


(b) Segmented address-space 


In specific, the authors in [9] suggest construct- 
ing a Boolean expression for a policy P,, using the rule 
constraints as follows: 

P,= V (ACy ATCZ A+! AaC_) AAC) 
i¢ index(a) 
where index(a) is the set of indices of rules that have a 
as their action and C; is the rule condition of conjunc- 
tive fields. In other words, 
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index(a) = {i| Rj =C; a} 

This formula can be understood as saying that a 
packet will trigger action a if it satisfies the condition 
Ci for some rule Ri with action a, provided that the 
packet does not match the condition of any prior rule 
in the policy. Table 1 shows an example of a policy of 
three intersecting rules forming total of four indepen- 
dent segments of policy address space. 


PolicyVis allows users to visualize only accepted 
or denied traffic; therefore it is important to efficiently 
extract those spaces from the policy. A naive algorithm 
to achieve this might take exponential running time. 
Fortunately, policy segmentation using BDD is quite 
effective in doing this. We decided to employ BDDs for 
segmenting rules to quickly retrieve correct accepted 
and denied spaces. This makes PolicyVis a reliable and 
fast tool. Policy rules are segmented using BDD right 
after they are read from the input file. This ahead-of- 
time rule segmentation speed up the process when the 
user chooses to visualize only accepted or denied traffic. 


Firewall Visualization Techniques 


In this section, we describe visualization tech- 
niques and methods used in our PolicyVis tool to 
achieve the objectives. More specific techniques and 
algorithms to visualize firewall anomalies are de- 
scribed later. 


To achieve the visualization effectiveness, Poli- 
cyVis supports both policy segments and policy rules 
visualization, which depends on properties of the 
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policy users want to examine. When dealing with only 
accepted or denied space, PolicyVis visualizes policy 
segments obtained from using BDD as mentioned ear- 
lier. However, when users choose to investigate both 
accepted and denied spaces together, PolicyVis visual- 
izes policy rules because the union of both spaces 
returns to the original rules. Moreover, visualizing 
policy rules in this case helps users capture all possi- 
ble rule interactions which is hard to conceive by 
looking at separate visualizations of both spaces. 


When users investigate a firewall policy scope (a 
field and a value), PolicyVis collects all rules (or seg- 
ments) that have the corresponding field as a superset 
of the scope input and visualizes those rules (or seg- 
ments). When choosing a scope to investigate, users 
want to inspect how the firewall policy applies to that 
scope, thus rules (or segments) that include only the 
address space of the target scope. Rules (or segments) 
are represented as rectangles with different colors to 
illustrate different kinds of traffic (accepted or denied). 
Those colors are set transparent so that rules overlap- 
ping with the same or different actions can be effec- 
tively recognized. Moreover, different symbols (small 
square and circle) placed at the corner of rectangles are 
used for different traffic protocols (e.g., TCP, UDP, 
ICMP, IGMP) and notations (i.e., tooltips or legends) 
are used to determine rules’ order and related things. 

When multiple rectangles (rules or segments) are 


sketched from the same coordinates, colors and symbols 
might not be enough to tell what kind of traffic or 
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Figure 1: A single firewall policy. 
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protocol a rectangle belongs to. Additional notations 
are used to clearly indicate those properties. Round 
brackets are used to tell if a rectangle represents denied 
traffic, otherwise it represents allowed traffic. Curly 
brackets are used to denote UDP protocol, otherwise it 
is TCP protocol. 


PolicyVis uses three different rule fields to build 
the policy graph, two of which are used as the graph’s 
vertical and horizontal coordinates and the third field 
is integrated into the visualization objects (e.g., at the 
corner of rectangles) avoiding 3D graphs for simplic- 
ity. In general, by default, PolicyVis chooses the in- 
vestigated scope as one of the coordinates (axes), and 
from three remaining fields, the least common field 
will be the other coordinate and the second least com- 
mon field will be the last dimension. 


Besides, PolicyVis places rule field values along 
x-axis and y-axis in such a way that the aggregated val- 
ues (e.g., wildcards) precedes the discrete values in the 
axis, or closer to the origin of the graph. Moreover, the 
width, the length and the position of a rectangle are 
chosen based on its corresponding rule’s attributes so 
that an aggregated rule or segment (represented by a 
rectangle) contains its subset ones in the graph and dis- 
joint segments or rules are represented by non-overlap- 
ping rectangles (there are no adjacent rectangles). 


Case Studies 


In this section, we created application scenarios 
to explore the potential of PolicyVis to help users find 
the policy misconfiguration problems. All the scenar- 
ios were created based on the single firewall policy 
shown in Figure 1. 


(@) Alowed traffic 


140.192.38.9-255 


() Denied traffic (_) Both 


140.192.38.4-7 
140.192.38.0-2 


140.192.37.2-255 


Source IP 


140.192.37.0 


140.192.36.3-255 


140.192.36.0-1 


161.120.33.41 





161.120.33.44 
Destination IP 


PolicyVis: Firewall Security Policy Visualization and Inspection 


Scenario 1: The admin receives an email from 
the SSH server development team mentioning that 
there currently exists a SSH server zero-day exploit in 
the wild. He wants to investigate the firewall policy 
for accepted traffic to port 22. The admin performs 
this investigation by choosing the target (scope): Des- 
tination Port with 22 as the input and viewing allowed 
traffic only as shown in Figure 2. 


Observation: policy segments that allow traffic 
to SSH (port 22) are extracted and visualized by Poli- 
cyVis as shown in Figure 2. Thus, the admin can then 
decide to block this traffic temporarily. 


Scenario 2: The University’s student database is 
stolen and the database server with IP address /6/. 
120.33.44 (possibly compromised) is suspected not 
protected well by the firewall. The admin wants to 
investigate the firewall policy applied to this server. 
He looks into the traffic allowed and blocked by the 
firewall for this IP address by choosing the target 
(scope): Destination Address with 161.120.33.44 as 
the input as shown in Figure 3. 


Observation: denied and allowed traffic to port 
1433 (MSSQL server) controlled by the firewall is 
almost like what the admin expected except the traffic 
from source address /40./92.37.2 (from rule number 
1) which should not be allowed. The problem is traffic 
allowed to 16/.120.33.* from rule 1 is also allowed to 
161.120.33.44. Thus, the admin might remove or 
change Rule 1 from the firewall. 


Scenario 3: The University’s whole network is 
down because of a denial of service attack. The admin 
suspects that this attack is from a specific region in a 
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Figure 2: Allowed traffic to port 22. 
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country with network IP address starting with /4/.*. 
* * aiming at many services including telnet, web, ftp, 
etc. He needs to revise the firewall policy for any traf- 
fic from any IP address starting with /4/.*.*.*, The 
admin chooses the target (scope): Source Address with 
141.*.*.* as the input and selects Destination Port 
(corresponding to University’s network services) as 
one of the graph dimensions as shown in Figure 4. 


Observation: the firewall policy currently blocks 
traffic to telnet service (port 23) and web service (port 


Scope | Destination Fddress || 161.120.33.44 | 
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80) from some IP addresses starting with /4/, however, 
SMTP service (port 25) and FTP service (port 2/) are 
accessible from most of IP addresses starting with /4/ 
and hence vulnerable to the attack. Thus, the admin 
may set firewall rules to block traffic from some or all 
addresses starting with /4/ to those services as well. 


Scenario 4: The University maintains two repli- 
cated TFTP servers (port 69) with IP addresses /6/. 
122.33.43 and 161.122.33.44 to satisfy students’ high 
demand of downloading video lectures and also increase 
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Figure 3: Traffic blocked and allowed to 161.120.33.44. 
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Figure 4: Controlled traffic from 141.*.*.*. 
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the downloading speed. However, several students still 
complain about low downloading speed and some- 
times they are blocked from downloading. The admin 
first checks the two servers and sees that they both are 
working well. He suspects that he might make mis- 
takes when writing firewall rules for the two servers 
so that one of them might not function as wanted. He 
needs to check the firewall policy and expects that the 
policy for both servers should be the same because 
they are replicated and have the same mission. The 
admin chooses the target (scope): Destination Port 
with 69 as the input as shown in Figure 5. 


Observation: traffic controlled by the firewall to 
the two servers is not the same. The admin recognizes 
that he made mistakes blocking traffic from /44.*.*.* 
and /45.*.*.* to server 161.122.33.44 when they should 
be allowed as to server /6/./22.33.43. Thus, the 
admin corrects his mistakes by changing the actions in 
the corresponding rules in the firewall. 


Visualizing Rule Anomalies 
Definition 


In this section, we mention crucial definitions 
and concepts of firewall policy anomalies introduced 
in [1] so that readers can understand how PolicyVis 
visualizes rule anomalies described in the next section. 


A firewall policy conflict is defined as the exis- 
tence of two or more filtering rules that may match the 
same packet or the existence of a rule that can never 
match any packet on the network paths that cross the 
firewall, e.g.: 

¢ Shadowing anomaly: a rule is shadowed when 
a previous rule matches all the packets that 
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match this rule and the shadowed rule will 
never be activated. 

¢ Generalization anomaly: a rule is a general- 
ization of a preceding rule if they have different 
actions and if the second rule can match all the 
packets that match the first rule. 

e Redundancy anomaly: a redundant rule per- 
forms the same action on the same packets as 
another rule such that if the redundant rule is 
removed, the security policy will not be affected. 

e Correlation anomaly: two rules are correlated 
if they have different filtering actions, and the 
first rule matches some packets that match the 
second rule and the second rule matches some 
packets that match the first rule. 


Rule Anomaly Visualization Methodology and 

Algorithm 

As the number of firewall rules increases, it is 
very likely that an anomaly will exist in the policy 
which threatens the firewall’s security. Anomaly dis- 
covery is necessary in order to ensure the firewall’s 
concreteness. Firewall policy advisor [1] is the first 
tool to discover anomalies in a firewall policy. How- 
ever, it is not as expressive as PolicyVis in anomaly 
discovery and doesn’t give users a visual view on how 
an anomaly occurs. 


Four classes of firewall policy anomalies men- 
tioned previously are visualized by PolicyVis. These 
anomalies are easily pinpointed by overlapping areas 
on the graph because an overlapping area represents for 
rules with overlapping traffic, which can potentially 
cause firewall policy anomalies. Each of the anomalies 
has specific features that are easily recognized on the 
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Figure 5: Firewall policy for destination port 69. 
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PolicyVis graph because its corresponding overlapping 
area is formed (or look) differently in terms of rectan- 
gles position, colors and notations. These features are 
different for all four anomalies. 


As PolicyVis visualizes rules in 2D-graph which 
shows users only three fields on the graph, an overlap- 
ping traffic area is a feature of a potential anomaly, 
however, it sometimes does not indicate that the cor- 
responding rules are really overlapping because their 
fourth field might be different. Nonetheless, PolicyVis 
still lets users visualize real anomalies by allowing 
related rules to be investigated more closely. When the 
user wants to investigate an overlapping area, he sim- 
ply clicks on it and PolicyVis will focus on more 
details of the related rules. 


PolicyVis first collects all rules containing the 
selected area, and then sketches a different graph for 


Complete overlap 


Partially overlap 
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these rules. In order to correctly view real anomalies 
with only three fields used on the graph, PolicyVis 
needs to choose a left-out field which is the same for 
all the related rules. This common field is guaranteed 
to exist because related rules from an overlapping area 
must have at least two fields in common. PolicyVis 
selects the most common and least important field to 
be the left-out one if there are multiple common fields 
among the related rules. 


Moreover, among three fields used for the focus- 
ing graph, PolicyVis picks the most common field over 
the related rules to be the third coordinate (the one inte- 
grated into visualization objects), and chooses the other 
two fields as the graph normal coordinates (used for 
axes). This coordinate selection technique assures users 
that, from this focusing view, an overlapping area defi- 
nitely indicates at least one anomaly in the policy. 
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Figure 6: Diagram to determine possible anomalies. 
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Figure 7: An example of a firewall policy. 
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To find the most common field over some fire- 
wall rules, for each rule field excluding the Action 
field, PolicyVis needs to find a rule’s field value which 
is a subset of all other rules’ field values, and compute 
the number of rules that have the field value equal to 
that rule’s field value. The field that has the biggest 
number is the most common field over the rules. The 
algorithm FindMostCommonField to find the most 
common field is implemented as shown in Table 2. 


How To Recognize Anomalies In PolicyVis 


The rules order is an important factor in under- 
stating the policy semantic and determining the fire- 
wall anomaly types, especially between shadowing 
and generalization anomalies. Besides allowing users 
to see the rules order by moving the mouse over the 
overlapping area, PolicyVis also uses surrounding rec- 
tangles (not color-filled) around the overlapping rule 
rectangles only in the focusing view to visualize rules 
or rectangles order in each overlapping area. The 
width (and height) difference between a rule rectangle 
and its surrounding one in an overlapping area is 
called boarder and it basically shows the rule order: 
the rule or rectangle with bigger boarder comes first in 
the policy. This technique will offer an easy way to 


Algorithm FindMostCommonField 
Input: rules 
Output: the most common field among the input rules 
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determine the type of the anomaly visually and with- 
out any manual investigation. 


Shadowing and generalization anomalies: These 
two anomalies can be recognized by a rectangle totally 
contained in another rectangle but have different 
colors (different filtering actions), the rules order 
(based on extra rectangles) will decide which anomaly 
the overlapping area belongs to. 


Redundancy anomaly: The features used to rec- 
ognize this anomaly are almost the same as features 
used to pinpoint shadowing and generalization anom- 
alies. Instead of having different colors, the overlap- 
ping rectangles should have the same color (same fil- 
tering action) and there is no another different color 
rectangle appears between them. 


Correlation anomaly: This anomaly is corre- 
sponding to two rectangles with different colors par- 
tially contained in each other. 


If two rectangles are not overlapping, there is no 
anomaly between two rules represented by those two 
rectangles. With the help of PolicyVis, it is straightfor- 
ward to pinpoint all anomalies that might exist in the 
firewall policy. Figure 6 summarizes the method to 


I: for each field in rule.fields/{action} 
Ze if field = dest_ip or field = src_ip 
ge Crietd = te 
4: end if 
3 if field = dest_port or field = src_port 
6: Crea = * 
fs end if 
8: for each rule R; in rules { find a field value which is a subset of all other field values } 
9: Crela = Cyieta OY R;,. field 
10: end for 
if: Nyetd = 0 
12: for each rule i in rules { count the number of rules having field value equal to the common subset value} 
13; if R;,. field =. Chietd 
14: Nyeta = Npeta + 1 
IS: end if 
16: end for 
17: end for 


18: N= max(N«aest ip» Nscr ip» Ndest_port Nscr_ port) { choose the most common field } 


19: if N= Nee port 


20: return src_port 
21: endif 

22: if N= Nijest port 

Zoe return dest_port 
24: endif 

as HN=N,,. jp 

26: return src_ip 
27: endif 

28: lfN= Nan o 

29: return dest_ip 
30: endif 


Table 2: Algorithm to find the most common field. 
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determine different rule anomalies which is very effec- 
tive in a visualized environment like PolicyVis. 
A Case Study 
Using PolicyVis to investigate the firewall policy 

shown in Figure 7, the firewall rules are visualized as 
shown in Figure 8. The admin sees many overlapping 
areas which might contain potential rule anomalies. 
There are five suspected overlapping areas (numbered 
on the graph) which the user believes contain rule 
anomalies. From this view only, he suspects that: 

1. potential of shadowing anomaly 

2. potential of generalization anomaly 

3. potential of correlation anomaly 
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4. potential of redundancy anomaly 
5. potential of generalization anomaly 


However, in order to make sure that those anom- 
alies are real anomalies, the admin needs to closely 
investigate each overlapping area. To do this, the 
admin simply clicks on each selected overlapping area 
and PolicyVis will focus on and show a more elabo- 
rated view for that area. 


Shadowing anomaly visualization: When the 
admin clicks on the overlapping area number / (Fig- 
ure 8), he is brought to the view where all traffic has 
the same Destination IP address /6/./20.33.41 as 
shown in Figure 9. From this view, it is clear that there 
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Figure 8: Many potential anomalies in the policy. 
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is a shadowing anomaly between rule 3 and rule 4 
(rule 4 is shadowed by rule 3) because the rectangle 
representing rule 4 is totally contained in the rectangle 
representing rule 3 and they have different colors. 
“Rule 3” and “Rule 4” tooltips appear in this case 
because the admin moves the mouse over the overlap- 
ping area. Without these tooltips, the admin still can 
tell that this is a shadowing anomaly because he 
knows the outer rectangle comes first in the policy 
based on the surrounding rectangles. 


Generalization anomaly visualization: When 
the admin clicks on the overlapping area number 2 
(Figure 8), he is brought to the view where all traffic 
has the same Destination IP address /6/.120.33.43 as 
shown in Figure 10. From this view, it is clear that 
there is a generalization anomaly between rule 5 and 
rule 6 (rule 6 is a generalization of rule 5) because the 
rectangle representing rule 5 is totally contained in the 
rectangle representing rule 6 and they have different 
colors. Moreover, without the tooltips (“Rule 5” and 
“Rule 6’’), the admin still can tell that the inner rectan- 
gle comes first in the policy based on the surrounding 
rectangles and hence this is a generalization anomaly. 
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Correlation anomaly visualization: When the 
admin clicks on the overlapping area number 3 (Fig- 
ure 8), he is brought to the view where all traffic has 
the same Destination Port 20 as shown in Figure 11. 
From this view, it is clear that there is a correlation 
anomaly between rule / and rule 2 because the rectangle 
representing rule / is partially overlapped with the rec- 
tangle representing rule 2 and they have different colors. 


Redundancy anomaly visualization: When the 
admin clicks on the overlapping area number 4 (Figure 
8), he is brought to the view where all traffic has the 
same Destination IP address /6/./20.33.43 as shown in 
Figure 12. From this view, it is clear that there is a 
redundancy anomaly between rule 2 and rule 13 (rule 13 
is redundant to rule 2) because the rectangle represent- 
ing rule /3 is totally contained in the rectangle repre- 
senting rule 2 and they have the same color. 


Overlap but no anomaly: When the admin 
clicks on the overlapping area number 5 (Figure 8), he 
is brought to the view where all traffic has the same 
Destination IP address 16/.120.33.45 as shown in Fig- 
ure 13. From this view, it is clear that there is no 
anomaly because the rectangles representing rules are 
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Figure 10: Generalization anomaly between rule 5 and rule 6. 
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not overlapping. Rule // and Rule /2 are overlapped 
in Figure 8 because Rule //’s Destination Address and 
Source Address are subsets of Rule 12’s Destination 
Address and Source Address respectively and those 
two fields with Destination Port are chosen as dimen- 
sions for the view as shown in Figure 8. However, 
Rule // and Rule /2 have different Source Ports which 
is automatically chosen by PolicyVis as one of the 
dimensions for the new view as shown in Figure 13. 


Visualizing Distributed Policy Configuration 
Concept 


While a single firewall is normally deployed to 
protect a single subnet or domain, distributed firewalls 
are essential for protecting the entire network. Any mis- 
configuration or conflict between distributed firewalls 
might cause serious flaws or damages to the network [2]. 


Anomalies exist not only in a single firewall but 
also in inter-firewalls if any two firewalls on a net- 
work path take different filtering actions on the same 
traffic. It is always a higher chance that distributed 
firewalls contain rule anomalies than a single firewall 
because of the decentralized property in distributed 
firewalls management. It is possible that each single 
firewall in the network might not contain any rule 
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anomaly, but there are still anomalies between differ- 
ent firewalls. 


Visualizing distributed firewalls gives the same 
benefits as visualizing single firewalls in achieving 
policy behavior discovery, policy correctness checking 
and anomaly finding. Distributed firewalls are consid- 
ered as a tree where the root is the borderline firewall 
which directly filters traffic in and out of the network. 
Each node in the tree represents for a single firewall 
which can be placed between subnets or domains in 
the network. 


A packet from outside of the network in order to 
get through a firewall needs to pass all filterings of all 
firewalls from the root to the node representing that 
firewall. In the distributed firewalls view, PolicyVis 
creates a firewall tree based on the network topology 
input files and let the user pick a path (from the root to 
any node) he wants to examine. PolicyVis then builds 
up a rule set for that path by simply reading rules from 
nodes in order from the root to the last node. After 
that, PolicyVis considers this rule set as for a single 
firewall and visualizes it as before. 


A Case Study 


The admin wants to investigate the distributed 
policy configuration applied to traffic to the Network 
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Figure 12: Redundancy anomaly between rule 2 and rule 13. 


Destination IP:161.120.33.45 


140.192.37.5 


Source IP 


140.192.37." 


34 








35 


Source Port 
Figure 13: There is no anomaly in this case. 
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Lab. He first changes the view to Distributed Fire- 
walls view and expands the tree to get to the Network 
Lab node. As shown in Figure 14, PolicyVis creates a 
new rule set containing all rule sets from firewalls on 
the path in this order: University of Waterloo, Math 
faculty, CS department, and Network Lab. 


After building up the rule set for the path from 
University of Waterloo to Network Lab, PolicyVis 
allows the admin to start visualizing the path policy. In 
this visualization, the admin chooses to investigate all 
rules on this firewall path that control traffic to any des- 
tination address in the university network by choosing 
the scope Destination Address with value 161.*.*.*. 


In this case, there are normally multiple subnets 
get involved because multiple firewalls are considered 
at once. PolicyVis not only lets the admin visualize all 
the subnets at the same time, but also supports a single 
view on each subnet and the admin can switch views 
between subnets easily. In this example, there are six 
subnets whose traffic are controlled by the firewalls on 
the path and the Network Lab subnet 161.120.33.* is 
currently viewed and analyzed by the admin (Figure 
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15). The admin can change the view to a different sub- 
net by clicking on the Next or Previous button. 


It is easy to recognize that while the single fire- 
wall placed at the Network Lab subnet (Figure 16) 
which only controls traffic to /6/./20.33.* doesn’t 
contain any anomaly, the distributed firewalls (Figure 
15) seems to have anomalies (overlapping areas). In 
fact, there is a shadowing anomaly in this case between 
a rule in the University of Waterloo firewall and a rule 
in the Network Lab firewall. 


Implementation and Evaluation 


We implemented PolicyVis using Java and Jfree- 
chart [7], a free open source Java chart library, in Poli- 
cyVis to make it easy for displaying charts in the 
graph. We also used Buddy [17] for BDD representa- 
tion of firewall policies. 


In this section, we present our evaluation study 
of the usability and efficiency of PolicyVis. To access 
the practical value of PolicyVis, we not only created 
firewall policies randomly (with and without rule 
anomalies), but also used real firewall rules in our 
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Figure 14: An example of distributed firewalls. 
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Task\Method PolicyVis Raw firewall rules 
Find firewall properties 10.44 minutes 
Find firewall anomalies | /.98 minutes 12.78 minutes 


Table 3: Average estimated time to achieve each task by using each method. 
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university for the evaluation study. Each firewall used 
in the evaluation test has from 30 to 45 rules. We then 
asked 11 people (with varying level of expertise in the 
field) under test to use both PolicyVis and raw firewall 
rules to find some specific firewall properties (like what 
traffic is allowed to a chosen domain or which machine 
has Web accessible web traffic and so on) and locate 
rule anomalies in the firewalls. We recorded the time to 
answer each task by using each method for all people 
and computed the average time over all. 


People in this evaluation test were getting familiar 
with PolicyVis very quickly and very confident with 
features supported by PolicyVis. As shown in Table 3, 
the average time to achieve each task by using Poli- 
cyVis is much faster than by investigating raw firewall 
rules, especially in finding firewall anomalies. This 
evaluation test demonstrated that PolicyVis is a very 
user-friendly tool with high usability and efficiency. 


Conclusion and Future Work 


Firewalls provide proper security services if they 
are correctly configured and efficiently managed. Fire- 
wall policies used in enterprise networks are getting 
more complex as the number of firewall rules and 
devices becomes larger. As a result, there is a high 
demand for an effective policy management tool which 
significantly helps user in discovering firewall policy’s 
properties and finding rule anomalies in both single and 
distributed firewalls. 


PolicyVis presented in this paper provides visual 
views on firewall policies and rules which gives users a 
powerful means for inspecting firewall policies. In this 
paper, we described design features of PolicyVis tool and 
illustrated PolicyVis with multiple examples showing the 
effectiveness and usefulness of PolicyVis in determining 
the policy behavior in various case studies. We presented 
concepts and techniques to find rule anomalies in Poli- 
cyVis. Besides, we also showed how PolicyVis visual- 
izes distributed firewalls to achieve same benefits as 
visualizing single firewalls. Finally, we presented the 
implementation and evaluation of PolicyVis. 


Using PolicyVis was shown very effective for 
firewalls in real-life networks. In regards to usability, 
unskilled people with short time of learning how to 
use PolicyVis can quickly understand and start using 
all features of PolicyVis. Moreover, by evaluation, 
PolicyVis effectively helped users including network 
security juniors and seniors to figure out firewall pol- 
icy behavior easily by reviewing the visualizing of 
primitive firewall rules. In addition, PolicyVis was 
shown a very good tool in finding rule anomalies or 
conflicts easily and quickly. The number of dimen- 
sions users need to consider during firewall inspection 
varies according to situations; however, considering 
all possible rule fields always gives users a better anal- 
ysis of the policy. 

Even though PolicyVis was shown a very effective 
tool, we still want to perform more evaluation on it and 
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collect more users’ ideas to make PolicyVis the best 
visualization tool for firewall policies. There are still 
many possible features that we want to implement in 
PolicyVis to maximize its usability as well as efficiency. 
We want PolicyVis to support more viewing levels of 
firewall policies and automatically show users possible 
strange behaviors and true rule anomalies of firewall 
policies on the graph. In addition, PolicyVis currently 
shows how to visualize stateless firewalls but we can 
easily envision extending this to visualize stateful fire- 
walls too by preprocessing the policy to create the 
implicit rules in stateful firewalls. We consider support- 
ing stateful firewalls in PolicyVis in our future work. 
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ABSTRACT 


Packet filtering firewall is one of the most important mechanisms used by corporations to en- 
force their security policy. Recent years have seen a lot of research in the area of firewall manage- 
ment. Typically, firewalls use a large number of low-level filtering rules which are configured us- 
ing vendor-specific tools. System administrators start off by writing rules which implement the se- 
curity policy of the organization. They add/delete/change order of rules as the requirements 
change. For example, when a new machine is added to the network, new rules might be added to 
the firewall to enable certain services to/from that machine. Making such changes to the low-level 
rules is complicated by the fact that the effect of a rule is dependent on its priority (usually deter- 
mined by the position of the rule in the rule set). As the size and complexity of a rule set increases, 
it becomes difficult to understand the impact of a rule on the rule set. This makes management of 
rule sets more error prone. This is a very serious problem as errors in firewall configuration mean 
that the desired security policy is not enforced. 


Previous research in this area has focused on either building tools that generate low-level 
firewall rules from a given security policy or finding anomalies in the rules, i.e., verifying that the 
rules implement the given security policy correctly. We propose a technique that aims to infer the 
high-level security policy from low-level representation. The first step in our approach is that of 
generating flattened rules, i.e., rules without priorities, which are equivalent to the given firewall 
rule set. Removal of priorities from a rule set enables us to merge a number of rules that have a 
similar effect. Our rule merging algorithm reduces the size and complexity of the rule set signifi- 
cantly by grouping the services, hosts, and protocols present in these rules into various (possibly 
overlapping) classes. We have built a prototype implementation’ of our approach for iptables fire- 
wall rules. Our preliminary experiments indicate that the technique infers security policy that is at 


a sufficiently high level of abstraction to make it understandable and debuggable. 


Introduction 


Firewalls are the first line of defense for protect- 
ing corporate networks. System administrators use 
packet filtering firewalls as one of the mechanisms to 
implement the security policy of an enterprise. These 
firewalls are configured using rules that specify match- 
ing criteria, and the action to be performed when a 
packet matches each rule. These rules are matched se- 
quentially against all packets passing through the fire- 
wall. These rules can be conflicting, i.e., multiple rules 
with different actions can match a packet. In such a 
case, the priority of the rules in the rule set determines 
the action to be performed. Typically, firewalls use a 
first match policy, i.e., the action corresponding to the 
first matched rule is taken irrespective of the other 
rules that can match the packet. Thus the order of rules 
in a firewall rule set defines a priority relation over the 
rules. Understanding the effect of firewall rules on 
network traffic is complicated by this priority relation 
between the rules. 

System administrators initially configure the fire- 
walls with rules that implement the security policy of 
the organization. As the requirements of the enterprise 


change, new rules are added or deleted from the rule 
set without refactoring. Over time, the rule set con- 
tains many rules which are very similar and the map- 
ping between the security policy and the rules becomes 
unclear. Managing such large rule sets becomes increas- 
ingly difficult leading to configuration errors which are 
a serious security concern [10]. Hence, firewall man- 
agement tools become necessary to help system admin- 
istrators. 


Many tools for firewall management (e.g., Fir- 
mato [2], Firestarter [3], Shorewall [4]) focus on gen- 
erating low-level rules from high-level policy lan- 
guage (or GUI). Recent years have seen many works 
[6, 13, 1] which try to discover configuration errors in 
the firewalls. But tools which aid in understanding ex- 
isting firewall rule sets are missing from the arsenal of 
system administrators. Some tools (e.g., IT Val [8, 9], 
Fang [7]) provide a way of querying whether certain 
packets will be allowed through the firewall. 

The problem with such tools is that the admini- 
strator has to know what to query for. Tools like 
Lumeta Firewall Analyzer [12] try to avoid this prob- 
lem by automating the task of querying the firewalls. 


‘This research is supported by NSF grants 0208877 and 0627687. Lumeta Firewall Analyzer queries the firewall for all 
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possible packets that are allowed to pass. For medium 
to large rule sets this results in a large amount of data 
being generated. Analyzing such large amounts of data 
presents another challenge to the system administrator. 


We present a novel way to address this problem 
in this paper. Our approach of inferring the high-level 
policy from low-level packet filtering rules presents 
the information to the user in a compact format. Fig- 
ure | shows 24 iptables rules taken from a larger fire- 
wall rule set (65 rules) being used for a network with- 
in our department.? It is quite difficult to understand 
what kind of traffic is allowed through the firewall 
looking at the script. A new system administrator who 
is assigned to manage a firewall rule set like this needs 
to understand the security policy so that she may an- 
swer questions such as: 

¢ which services are allowed on each host? 
¢ which hosts are allowed to communicate with 
each other? 
© what protocol is valid between communicating 
hosts? 
Figure 2 shows the 10 rule policy generated by our 
technique for the same rule set. Clearly it is easier for 
a system administrator to understand this policy than 
the rule set in the Figure 1. Moreover, the high-level 
policy can reveal opportunities for refactoring the low- 
level rules. 


System administrators have an intuitive notion of 
whether a policy is “complicated” or “‘simple.”’ The 
complexity of a policy depends not just on the number 
of rules in the policy but also on how complicated those 
rules are. In this work, we define a metric for the 


2We have modified the IP addresses due to privacy con- 
cerns. 

3Rules in flattened rule set and high-level policy are la- 
beled with alphabets to emphasize the fact that these rules 
do not have any priority relation defined over them 
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complexity of a rule set/policy that captures this notion. 
This allows us to compare different representations of 
the same rule sets. Our technique infers policies with 
low complexity and hence these policies are easier to 
understand. 


Our objective was to develop a technique to infer 
firewall policies that would help the system adminis- 
trators to work at a higher level of abstraction. Our 
technique can be combined with existing techniques to 
form a comprehensive firewall management toolkit. 
The benefits of such a toolkit are clear from the fol- 
lowing scenario: a system administrator who needs to 
modify some existing legacy firewall rule set can ex- 
tract the security policy from the rule set using our 
technique. She can then make changes to the high-lev- 
el policy and use an automated tool to generate the 
low-level rules. 


Since our technique uses decision tree like graphs 
(explained in Section Priority Elimination Phase) to 
represent the firewall rules, it is easy to enhance our 
system to provide querying facility. Moreover, our sys- 
tem automatically removes redundant rules from the 
policy. Hence, it is a trivial task to identify such redun- 
dant rules in the input rule set using our technique. 


We initially present an overview of our approach. 
The next two sections provide the details of the com- 
ponents in our system. We then discuss related work 
followed by concluding remarks in final section. 


Approach Overview 


Our approach for inferring policy consists of two 
phases. First, in priority elimination phase, we convert 
the low-level rule set that contains rules with priorities 
to an equivalent rule set that contains rules with no 
priority relation defined over them. We call the gener- 
ated rules as flattened rules. The flattened rule set 
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parameter-problem -j ACCEPT 
source-quench -j ACCEPT 


Figure 1: Sample iptables script. 


1. IPTABLES -A FORWARD -p tcp -d 192.168.1 
2. IPTABLES -A FORWARD -p tcp -d 192.168.1 
3. IPTABLES -A FORWARD -p tcp -d 192.168.1. 
4, IPTABLES -A FORWARD -p tcp -d 192.168.1 
5. IPTABLES -A FORWARD -p tcp -d 192.168.1 
6. IPTABLES -A FORWARD -p tcp -d 192.168.1 
7. IPTABLES -A FORWARD -p tcp -d 192.168.1 
8. IPTABLES -A FORWARD -s 192.168.1.126/25 
9, IPTABLES -A FORWARD -s 192.168.1.126/25 
10. IPTABLES -A FORWARD -s 192.168.1.126/25 
11. IPTABLES -A FORWARD -s 192.168.1.126/25 
12. IPTABLES -A FORWARD -p tcp -d 192.168.1. 
13. IPTABLES -A FORWARD -s 192.168.1.254/28 
14. IPTABLES -A FORWARD -s 192.168.1.236 -p 
15. IPTABLES -A FORWARD -s 192.168.1.254/28 
16. IPTABLES -A FORWARD -s 192.168.1.254/28 
17. IPTABLES -A FORWARD -p udp -d 192.168.1. 
18. IPTABLES -A FORWARD -p udp -d 192.168.1. 
19. IPTABLES -A FORWARD -s 192.168.1.254/28 
20. IPTABLES -A FORWARD -s 192.168.1.236 -p 
21. IPTABLES -A FORWARD -d 192.168.1.126/25 
22. IPTABLES -A FORWARD -d 192.168.1.126/25 
23. IPTABLES -A FORWARD -d 192.168.1.126/25 
24. IPTABLES -A FORWARD -j REJECT 
18 
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does not contain any overlapping rules, i.e., there is 
one and only one flattened rule that can match a given 
packet. This simplifies the process of inferring poli- 
cies from rule sets. Unlike the original rules, flattened 
rules can be arbitrarily reordered without modifying 
their overall effect. This enables us to reorder and 
merge similar rules together, thereby reducing the size 
and complexity of the generated policy. 


The problem with priority elimination is that it 
generates a large number of rules. In the policy infer- 
ence phase, we reduce the number of rules by group- 
ing hosts, services, and protocols into (possibly over- 
lapping) classes and merging rules containing same 
class of objects. It is not sufficient to produce rule sets 
with small number of rules as the complexity of the 
generated rules also affects the complexity of the en- 
tire rule set. Arbitrary merging of rules can lead to rule 
sets which are very complicated. So this phase tries to 
merge the rules such that the complexity of the in- 
ferred policy is minimized. Finally, the inferred policy 
is presented to the user. 


Background 


For concreteness, we describe the details of ipta- 
bles. Almost every packet filtering firewall relies on 
the type of rules used in iptables. Iptables [14] is the 
user space command line program used to configure 
the rule set in the netfilter framework in Linux 2.4.x 
and 2.6.x. Netfilter framework enables packet filter- 
ing, network address (and port) translation (NA[P]T) 
and other packet mangling. Jptables can be used to 
configure three independent tables — filter, nat, and 
mangle, within the kernel. The filter table is used to 
set up rules that are used for filtering packets, while 
the nat table is consulted when a packet that creates a 
new connection is encountered and the mangle for 
specialized packet alteration. 


In this paper, we are concerned only with the fil- 
ter table which is used as a packet filtering firewall. 
The filter table consists of ordered lists of rules that 
are called chains. The order of rules in a chain deter- 
mines their priority. There are three built-in chains in 
the filter table. INPUT chain is used to filter packets 
that are destined for the host on which the firewall is 
running. OUTPUT chain is used to filter packets gen- 
erated by the firewall host. FORWARD chain is used 
to filter packets forwarded by the firewall host to other 
hosts in the network. One powerful feature of iptables 


Allow only the following packets: 


tcp TO 192.168.1.126/25 FOR auth 


tcp, udp TO 192.168.1.250 FOR domain 
tep, udp TO 192.168.1.251 FOR smtp 


ur rmhoanop 


tcp TO 192.168.1.252 FOR www, https 


tcp TO 192.168.1.251 FOR smtps, imaps, pop3s 
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is that it allows the user to define new chains in addi- 
tion to the built-in ones. This allows the administrators 
to group rules which together provide certain high-lev- 
el function like protecting a subnet. 


An iptables rule consists of matching criteria and 
the target. Target specifies the action to be taken when 
a packet satisfies the matching criterion. Matching cri- 
teria is specified in terms of tests on the packet header 
fields like destination IP address (dhost), source IP ad- 
dress (shost), destination port (dport), source port 
(sport), protocol (proto). Target can be any of the fol- 
lowing: ACCEPT, QUEUE, REJECT, DROP, LOG or 
name of a user defined chain. 


Target ACCEPT means that the packet is to be 
allowed to pass through the firewall. QUEUE passes 
the packets on to the user space. For our purposes, se- 
mantically QUEUE is similar to ACCEPT as the pack- 
et is allowed to reach its destination. So we treat 
QUEUE just like ACCEPT and omit it from our dis- 
cussion. DROP and REJECT mean the packet is to be 
denied. REJECT returns an icmp error packet to the 
sender while DROP denies the packet without giving 
any error indication. Target LOG on the other hand 
just makes an entry to the log file when a matching 
packet arrives. By specifying user defined chains as 
targets, conditional call/return semantics can be added 
to the firewall rule set. 


Figure 3 shows a sample iptables rule set. All the 
rules considered are for the FORWARD chain with a 
default policy of REJECT. Rule | specifies that all 
hosts from the network 192.168.1.0/24 are allowed to 
connect to the host 120.240.18.1 using SMTP. Rule 2 
specifies that host 120.240.18.1 can connect to the net- 
work 120.240.20.0/24 using SMTP. This can be a real 
world scenario where 192.168.1.0/24 is an internal 
network of an organization, 120.240.20.0/24 is exter- 
nal network and 120.240.18.1 is the SMTP server for 
that organization. The rule set says that SMTP server 
can send SMTP traffic to external network and inter- 
nal hosts can send SMTP traffic to SMTP server but 
not to the external network. 


We represent the iptables rules in tabular format 
for ease of understanding. Table 1 is the tabular repre- 
sentation of the rules in Figure 3. We list all rules in a 
table in the order of their priority. The columns indi- 
cate the packet fields being tested. A rule is represent- 
ed as a row with values for the packet fields being 


tcp, udp FROM 192.168.1.254/28 TO 192.168.1.11 FOR sunrpc 

udp FROM 192.168.1.254/28 TO 192.168.1.11 FOR nfs, ports [4000-4002] 
tcp FROM 192.168.1.126/25 TO [192.168.1.13 - 15], 192.168.1.20 FOR ssh 
tcp, udp FROM 192.168.1.236 TO 192.168.1.35 FOR ipp 


icmp TO 192.168.1.126/25 OF TYPES destination-unreachable, parameter-problem, source-quench 


Figure 2: Higher level policy for rules in Figure 1. 
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tested filled in the respective columns. If a rule does 
not contain any test on a particular field, then that col- 
umn has a wild-card character ““*’’. A “‘*” for a field 
indicates that any value of the field will match this 
rule. The action associated with a rule is shown in the 
last column. Note that we omit many fields like icmp- 
type from examples to avoid clutter. 


Even though our technique can be applied to 
chains with default ACCEPT policy, all examples and 
discussions assume that the chains have default RE- 
JECT policy. Note that the chain name is shown above 
the rules in the tabular format to make the tables more 
understandable. In practice, we generate different rule 
sets for different built-in chains. 


Priority Elimination Phase 


In this phase we take the rule set with priorities 
and generate flattened rule set. Flattened rule set con- 
tains rules which have no priority relation so they can 
be arbitrarily reordered and merged to generate a com- 
pact policy. The idea behind flattening of rules is sim- 
ple. Consider a rule set RS with rules R; such that pri- 
ority of R; is higher than the priority of R; iff i <j. The 
semantics of such a prioritized rule set are that a pack- 
et is matched by a rule R; iff it satisfies the matching 
criteria of R; and doesn’t satisfy the matching criteria 
of any of the rules R; such-that i <7. In other words, a 
packet can match a rule only if it is not matched by 
any higher priority rule. For example, R; can match a 
packet only if R; and R, do not match it. Thus a prior- 
itized rule set can * Sees to a flattened rule set 


by replacing R; by AnR AR;,2 Sj <n where n is 


the total number of rales.4 in ie set. In our example, 
R, will not be modified while R, will be replaced by 
aR, AR> and R; by aR, AAR, AR3. 

The problem with the naive way of generating 
flattened rules is that it can lead to exponential number 
of rules. In [11], we developed a way of creating a di- 
rected acyclic graph (DAG) called packet classifica- 
tion automaton that avoids this exponential blowup. 
We use the packet classification automaton to generate 
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flattened rule set. Here we describe the characteristics 
of the automaton without going into the details of the 
construction algorithm; which can be found in [11]. 
© each node (except the final nodes) is annotated 
with a packet header field. This denotes that the 
node performs a test on that field. 
the leaf nodes correspond to the action to be 
performed when a packet is matched by a rule. 
For example, for iptables the leaf nodes corre- 
spond to the higher level actions allow and deny. 
We map targets like ACCEPT, QUEUE, and 
LOG to allow and REJECT and DROP to deny. 
A path from the root to a leaf represents the 
tests to be performed on a packet to match it 
against the rule set and the action to be taken on 
the packet. 
at each node the outgoing edges are labeled with 
the different values specified for the packet head- 
er field specified on the node. 
at each node there is an additional outgoing 
edge called as “else” edge. This edge is taken 
when a packet has a field value different from 
any of the values listed in the other outgoing 
edges from that node. 
nodes at the same height in different subgraphs 
can have tests for different fields. 


This automaton has the following interesting 
properties: 

¢ Property 1 Packet classification automaton is 
equivalent to the prioritized rule set, i.e., any 
packet that is allowed/denied by the rule set has 
a path from root to allow/deny node in the au- 
tomaton and vice versa. 

¢ Property 2 Jf the input rule set is comprehen- 
sive, i.e., for every packet there is a rule in the 
rule set that matches it, then the automaton has 
a path from root to a leaf for every packet. 
Moreover, the path from root to leaf is unique. 


We generate flattened rule set by considering all 
paths from root to the leaves in the graph. Each path 
corresponds to a rule in the flattened rule set. Consider 
a iptables script with the three rules for FORWARD 
chain shown in Listing 1. 


1. IPTABLES -A FORWARD -s 192.168.1.0/24 -d 120.240.18.1 --dport 25 -j ACCEPT 
2. IPTABLES -A FORWARD -s 120.240.18.1 -d 120.240.20.0/24 --dport 25 -j ACCEPT 


3. IPTABLES -A FORWARD -j REJECT 


Figure 3: iptables rule set 1. 





FORWARD (Default: Reject) 


|#[  shost | sport | ___dhost__—|_dport | target _ 


[1 | 192.168.1.0/24 | * | 120.240.181 | 25 | ACCEPT 
(2 | 120.240.18.1 | * | 120.240.20.0/24 | 25 | ACCEPT | 





Table 1: iptables rules in Figure 3 represented in tabular format. 


-d 192.168.1.1 -s 192.168.1.3 --dport 22 -j ACCEPT 
-d 192.168.1.2 -s 192.168.1.4 --dport 22 -j ACCEPT 


-j REJECT 


Listing 1: Sample iptables rule set as input. 
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Figure 6 shows the packet classification automa- 
ton for this sample rule set. Here each node is labeled 
with the packet field being tested at that node. 


Pruning 


Figure 4 shows the packet classification automa- 
ton for a firewall rule set with 65 rules for a small net- 
work within our department. The bottom row has two 
leaf nodes corresponding to the actions: allow and de- 
ny. We can see that even for small sized rule set, there 
are a large number of paths in the graph. 


We prune this graph to reduce the number of 
paths that we have to consider. The following are the 
steps that we perform for pruning this graph. 

1. We remove deny node and all incoming edges to 
it. If this results in an intermediate node becom- 
ing leaf node, then we remove that node and its 
incoming edges. We recursively do this for all 
ancestors of deny except the root node. Figure 7 
shows the graph after deny node has been re- 
moved from the graph in Figure 6. Now the 
graph contains only paths from root to allow. 
This means that the flattened rule set that we 


loWole 


OO 





Figure 4: Initial graph for network in the department. 
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generate from this graph is no longer comprehen- 
sive. But this problem can be easily solved by 
having a default reject policy for the flattened 
tules, i.e., packets that are not matched by any 
rule in the flattened rule set are discarded. 

2. We do a bottom-up traversal of the graph and 
merge equivalent states. We consider two states 
r and s as equivalent if they have transitions to 
the same state ¢; on label /; for all outgoing 
edges. For the graph in Figure 7, both the dport 
nodes have an edge to allow for value 22. These 
nodes are merged to get a graph as shown in 
Figure 8. 

3. The previous two steps create new opportuni- 
ties for reducing the number of paths though 
the graph. We can now merge multiple edges 
which connect the same nodes. In the case 
where the edge merging involves else edge, the 
merged edge is labeled with else. A special case 
of this is when all outgoing edges from a node 
can be merged with else edge. In this case the 
node with the merged outgoing edges is re- 
moved as the merged edge indicates that this 








Figure 5: Pruned graph for graph in Figure 4. 
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transition is taken irrespective of the value of the 
field in the removed node. This edge merging is 
also done in a bottom up fashion. 





192.168.1.1 }192.168.1.2 


Figure 6: Unpruned graph for sample rules. 


Figure 5 shows the pruned graph corresponding 
to the graph in Figure 4. We can read off all the paths 
from the root to accept to get flattened rule set. 


Policy Inference Phase 


As the main goal of our research was to come up 
with a compact representation of the rules, we asked 
ourselves the following questions: 

© how can we compare two representations of the 
same rule set? 

e how can we reduce the complexity of the flat- 
tened rules? 


In this section we present our answers to these 
questions. 


Complexity 


We call a particular representation of rules as 
policy. Intuitively, a policy that requires a bigger de- 
scription is more complex. For example, consider an 
organization which has 192.168.1.0/24 as the internal 
network. It has rules that allow auth packets to all 
hosts in the network and ssh packets only to the hosts 
192.168.1.5, 192.168.1.6, and 192.168.1.7. These rules 
can be represented as Policy J as shown in Listing 2. A 
more compact way of representing these rules (Policy 
2) is shown in Listing 3. 

We capture this notion of how complicated a pol- 
icy is by the following definition: 


Accept packets 


Tongaonkar, Inamdar, & Sekar 


Definition 3: Complexity of a policy 
¢ Data item is any value that is used in a rule. 
¢ Complexity of a rule is the number of data 
items present in the rule. 
¢ Complexity of a policy is the sum of the com- 
plexity of all rules in the policy. 





192.168.1.1\192.168.1.2 


Figure 7: Graph with deny node removed from the 
graph in Figure 6. 


Data items refer to the different values of packet 
fields present in the policy. For example, in the policy 
given above, the data items are 192.168.1.0/24, [192.168. 
1.5 -|7], auth, and ssh. For the earlier policy the data 
items are [192.168.1.0 - 4], [192.168.1.8 - 255], [192.168.1. 
5 - 7], auth, and ssh. The complexity of rules in Policy | 
is 3 for rule a and 3 for rule b. We note that ranges such 
as [192.168.1.5 - 7] are treated as a single data item and 
hence contribute 1 to the complexity of a rule. Even 
though our examples contain * for certain fields, they 
are there just to increase readability. Our final policy (as 
shown in Figure 2) does not contain any “*”. There- 
fore, “*” doesn’t contribute anything to the complexity 
of a rule. Similarly, in Policy 2 the complexity of rules 
a and b are 2 each. The complexity of Policy 2 is 4. 
This is less than that of Policy J which is 6. Hence we 
can say that Policy 2 is a more compact representation 
than Policy 1. 


Problem Statement 


We describe the problem statement in this sec- 
tion. At the end of priority elimination phase we have 


a. TO [192.168.1.0 - 4], [192.168.1.8 - 255] FOR auth 


b. TO [192.168.1.5 - 7] FOR auth, ssh 


Listing 2: Policy 1. 


Accept packets 
a. TO 192.168.1.0/24 FOR auth 
b. TO [192.168.1.5 - 7] FOR ssh 


Listing 3: Policy 2 — More compact version of policy in Listing 2. 
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a large number of rules. We want to merge the rules in 
such a way that the complexity of the generated rule 
set is minimum. We can merge rules which have the 
same values for all but one field by performing a 
union operation over the field which has differing val- 
ues in the rules. The main issue with merging rules is 
that different subsets of rules can be merged on differ- 
ent fields. So we need to select the subsets in such a 
way that merging them leads to minimum complexity 
of the entire rule set. This problem can be illustrated 







192.168.1.1\192.168.1.2 


192.168.1.4 


Figure 8: dport node merged from graph in Figure 7. 


by considering the rule set shown in Table 2. We can 
merge rules b and c on shost to get a new rule {bc}. 
Note that we label a merged rule by concatenating the 
labels of the original rules. We enclose the new label 
in braces to indicate how the rules merged.4 We can 
then merge {bc} with a to get rule {a{bc}}. This gives 
the following policy (Policy 3) which has a complexi- 
ty of 12; see Listing 4. 


4The notation {bc} is different from {b, c} which means a 
set containing the rules b and c 


Accept packets 
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The complexity of each rule is written in paren- 
thesis besides the rules. Since all the rules in the flat- 
tened rule set have the same action, we can even gen- 
erate overlapping rules to get a more compact policy. 
We can merge rules a, b, and c as before to get rule 
{a{bc}}, and also merge c and d on dhost followed by 
{cd} with e on dport. Listing 5 shows policy (Policy 
4), with complexity 10, that is obtained by merging 
the rules in this way. 


The Policy Complexity Definition is tied to the 
way the rules are represented. Thus, the interesting 
question is given a rule set, can we find an equivalent 
representation which enforces the same security policy 
but has lower complexity? Our goal is to find a repre- 
sentation of the rule set that implements the same poli- 
cy and has minimum complexity. A natural way to se- 
lect the policy with minimum complexity is to assign 
weights based on the complexity to all subsets of the 
flattened rule set and select a minimum weight set 
cover. The problem of determining the minimum 
weight set cover is NP-complete. However, we have 
found that fairly simple methods, based on exploiting 
the structure of the flattened rules, yield good results 
for finding a policy with low complexity. 


Computing Weight of Subsets of Rules 


Computing the complexity of all subsets of a rule 
set is also very hard. To see this, consider that we have 
a set of n rules. We want to find a policy with minimum 
complexity for this rule set. To find the complexity of a 
set containing n — | of these rules is similar to the origi- 
nal problem. 


In practice, we can avoid this problem as we do 
not need to generate all subsets of rules in the original 
rule set. To understand this consider original rule set 
{a, b, c} such that rules a and b can be merged on cer- 
tain field. Now the weight of {a,b} is the complexity 
of the merged rule {ab}. But the weight of {b,c} is the 
sum of the complexities of b and c. So we do not need 


fatbe}}. 192.168.1.1, 192.168.1,.10 TO 192.168.2.1 FOR http, emtp (5) 
d. 192.168.1.10 TO 192.168.2.2 FOR smtp (3) 
e. 192.168.1.10 TO 192.168.2.1, 192.168.2.2 FOR ssh (4) 


Listing 4: Policy 3. 










| host 

ja | 192.168.1.1,192.168.1.10 | 192.168.2.1 | 80_ | ACCEPT | 
pb] 1921681 | 192,168.21 | 25 

pd | 192.168.1.10 | 192,168.22 | 25 

pe | 192.168.1.10 | 192.168.2.1, 192.168.2.2 | 22 






Table 2: Sample flattened rule set. 


Accept packets 


{afbe}}. 192.168.1.1, 192,.168.1.10 TO 192.168.2.1 FOR http; smtp (5) 
{{cd}e}. 192.168.1.10 TO 192.168.2.1, 192.168.2.2 FOR smtp, ssh (5) 


Listing 5: Policy 4. 
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to consider the weight of {b,c} if we have the weights 
for {b} and {c}. This leads us to conclude that we 
need to consider only the subsets of the original set in 
which each rule merges with some other rule or with a 
new merged rule. We maintain the subsets of rules that 
we need to consider in a working set. 


Here we describe the algorithm shown in Figure 
9 to generate the working set. Initially we have a set 
R of flattened rules {7;,72,...,7,}. We want to gener- 
ate a set W that contains the subsets of R that we 
need to consider for the minimum weight set cover 
problem. We start by putting all singleton subsets of 
R in W. For example in Figure 2, 

W = {{a}:4, {b}:3, {c}: 3, {d}:3, {e}: 4}. 
Note that the value after “*:” is the weight of the corre- 
sponding set. For singleton sets, the weight is the same 
as the complexity of the rule. Now we compare each 
element in W with the other elements in W. If the 
two elements under consideration can be merged, then 
we merge them and add the new merged rule to the 
working set. Now, 

W = {{a}:4, {b}:3, {c}:3, {d}:3, {e}:4, 
{bc}:4, {cd}:4}. 
We continue this process of merging and adding new 
rules till no more rules can be added to W. In our ex- 
ample, {bc} can be merged with {a} and {cd} with 
{e}. After this no more merging is possible. So the fi- 
nal set is 
W = {{a}:4, {b}:3, {c}:3, {d}:3, {e}:4, 
{bc}:4, {cd}:4, {a{bc}}:5, {{cd}e}:5} . 

Merge Graphs 

Comparing each element in W with all the other 
elements is computationally expensive. We overcome 
this problem by generating a graph, which we call as 
merge graph, that allows us to avoid many compar- 
isons. Merge graph is an acyclic graph which initially 
contains nodes corresponding to the flattened rules 
and no edges. As we merge rules, we add nodes corre- 
sponding to the merged rules and edges from the new 
node to the constituent rule nodes. For example, after 
we add {bc} to W, we can represent W as a merge 


1. procedure GenerateWorkingSet(R) { 
W= 
3 for i= 1 to |R| do 

4 WeW U fr} 
5. end 

6. for i= 1 to|W| do 

7 for j = 1 to|W| do 
8 

0 


S =MergeRules(w;, w;) /* w;, w; € W */ 


‘ if S$ #o 
10. W=WUS 
Li. endif 
12, end 
13. end 
14, return W 
is. 3 


/FR= {71,125- : 
/* add all flattened rules to the working set */ 
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graph as shown in Figure 10. The merge graph may 
contain disconnected subgraphs. For each node in the 


{be} 

{a} {b} {c} {d} {e} 
Figure 10: Merge graph after adding {bc}. 
{a{bc}} {{cd}e} 

{bc} {cd} 
{a} {b} {c} {d} {e} 


Figure 11: Final merge graph. 


graph, we compare it with only the nodes in other dis- 
connected components. This way we avoid redundant 
comparisons. In Figure 10, we compare the rule {bc} 
with {a}, {d}, and {e}. This avoids redundant com- 
parisons with {b} and {c}. For large rule sets with 
multiple rules that merge, this optimization proves 
very useful. Figure 11 shows the final merge graph for 
our example. 


Solving Minimum Weight Set Cover 


Now that we have the working set we can find 
the minimum weight set cover to get a policy with low 
complexity. Conceptually we can think of each ele- 
ment of W as a set containing the rules that have been 
merged to form that element. For the example in the 
previous section, 


so hg) Mi 


/* add the new merged rule to the working set */ 


Figure 9: Algorithm for Constructing Working Set. 
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W = {{a}:4, {b}:3, {c}:3, {d}:3, {e}:4, {b,c}:4, 
{c,d}:4, {a,b,c}:5, {c,d,e}:5} . 
Our target is to find a set cover C = {c1,c2,...,Cx}, 


k 
ie, CoWA © c,;=R such that )) weight(c;) is 
i=l i=] 
minimum. 

Figure 12 shows our algorithm based on greedy 
heuristics to solve the problem. We use a set, A, to 
keep track of the rules that are covered as the set cover 
C is built. For each set in W, we define, 
weight(w;) 
where cost(w;) represents the cost incurred (in terms 
of weight) per new rule that will be covered by includ- 
ing a set w; in the set cover. In each iteration, we pick 
a set w; that has the lowest cost (step 6 in our algo- 
rithm). In our example, initially the cost is 4/1 = 4 for 
{a}, 4/2 =2 for {b,c}, 5/3 = 1.67 for {a,b,c} and so 
on. In case of a tie, we pick the set with higher cardi- 
nality. This algorithm returns C = {{a, b,c}, {c,d,e}} 
as the minimum weight set cover. We know that these 
sets correspond to merged rules {a{bc}} and {{cd}e}. 
So we can represent the rules in the example as Policy 
4 as shown in the previous section. 


cost(w;) = 


Related Work 


There are many tools (e.g., Firmato [2], Shore- 
wall [4], Firestarter [3]) that are available for generat- 
ing low-level firewall rules from high-level policy. 
Our technique can be used in conjunction with these 
tools to help refactoring. These tools can be used to 
generate firewall rules from scratch. If our technique 
is combined with these tools then we can use these 
tools to make changes to existing low level rules. 


Fang [7] and ITVal [9] are tools that provide 
querying facility. But it puts the onus on the system 
administrator to figure out what queries to perform. 
Lumeta Firewall Analyzer [12] solves this problem by 
querying the system for all packets that can be ac- 
cepted. The problem with this approach is that this re- 
sults in a large amount of data being presented to the 
user. In contrast, we try to present the result of our 


1. procedure MinimumWeightSetCover(R, W) { 
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analysis in a compact fashion to the administrator. 
Moreover, it is easy to provide querying capability us- 
ing our technique. 

Yuan, et al. [13], Gouda, et al. [6], Al-Shaer, et 
al. [1] have looked at the problem of identifying con- 
figuration errors in the firewall rules. The problem 
with these approaches is that the administrator has to 
decide whether the alert generated by these tools are 
due to rules that are put in intentionally or unintention- 
ally. Our technique can help in solving this problem 
by providing a high level view of the security policy. 


Marmorstein, et al. [8] generate policy by group- 
ing similar hosts. Our work is more general in the 
sense that we can group together arbitrary things to 
generate more compact representation. Golnabi et al. 
[5] have looked at the problem of generating high lev- 
el policy. But their approach is based on data mining 
of firewall logs while we try to extract the policy from 
the rules itself. 


Conclusions 


In this paper, we presented a new technique for 
extracting high-level security policy from low-level 
rules. Unlike previous techniques, our technique gen- 
erates policy which is compact. This will help system 
administrators to understand the existing low level 
rule sets and encourage them to refactor the low-level 
rules instead of making small changes to the rules set 
when requirements change. This will make the rule 
sets more manageable and will likely result in reduc- 
ing the errors in configuration of firewalls. We also 
presented a way for comparing whether one policy 
representation is better than another. In our prelimi- 
nary experiments, we obtained 50 flattened rules with 
197 data items after the priority elimination phase 
from a 65 rule firewall. The final inferred policy on 
the other hand had just 21 rules with 129 data items. 
These results indicate that we can generate high level 
policy which is easier to understand and manage. 
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ABSTRACT 


Firewall policies can be extremely complex and difficult to maintain, especially on networks 
with more than a few hundred machines. The difficulty of configuring a firewall properly often 
leads to serious errors in the firewall configuration or discourage system administrators from 
implementing restrictive policies. 

In previous research, we developed a technique for modeling firewall policies using 
Multiway Decision Diagrams and performing logical queries against a decision diagram model. 
Using the query logic, the system administrator can detect errors in the policy and gain a deeper 
understanding of the behavior of the firewall. The technique is extremely efficient and can process 
policies with thousands of rules in just a few seconds. While queries are a significant improvement 
over manual inspection of the policy for detecting that errors exist, they provide only limited 
assistance in repairing a broken policy. In this paper we present two extensions to our work, 
examples and history, which enable the administrator to more easily repair a policy which contains 
errors. 


An example is a representative packet which illustrates that the firewall complies with or 
(more importantly) deviates from its expected behavior. History records the specific rules involved 
in the deviation. Examples and history provide guidance in finding and fixing faults in a firewall 
rule set. These contributions can be also be used with the equivalence class analysis to reduce the 
burden of designing a complicated set of assertions. 


Introduction 


The administrator who maintains a restrictive 
firewall policy on a large network must spend a con- 
siderable amount of time and effort updating and test- 
ing the filtering rules. Requests for new services, 
changes in the physical topology of the network, and 
the emergence of new security threats require contin- 
ual modification of the policy. As the policy changes 
and grows, it can be difficult for the administrator to 
avoid introducing errors into the rule set. Repairing 
these errors is often very challenging. Firewall policy 
errors are subtle and difficult to detect. Even when the 
existence of an error is obvious, discovering the source 
of the problem and correcting it can be tedious and 
expensive. 













[Target [Source | Destination | Interface | Flags — 
[pRoP__| 192.168.1.0724 | anywhere | tet? |_| 
/DROP_| 192.168.3.0722 | 192.168.2.034 [any | 

DROP | anywhere | 192.168.2024 | any | 
5 | ACcEPT | 192.168.1.0724 | anywhere | any | 


In previous work, we introduced techniques for 
quickly and easily validating a firewall policy using 
logical queries against a Multiway-Decision Diagram 
model of the firewall policy. The MDD approach is 
very efficient (complex queries involving rule sets 
with hundreds of rules usually take only a few sec- 
onds) and allows for very flexible identification of 
errors. Like most existing approaches, the MDD query 
technique addresses the issue of testing the firewall for 
errors, but leaves the problem of repairing the policy 
entirely up to the administrator. Because tracing through 
dozens or perhaps hundreds of correct rules to find the 
two or three critical inconsistencies can take hours or 
even days, this is a significant burden for administrators 
of a large network. 


Figure 1: A rule set which incorrectly blocks access from the 192.168.1.0/24 subnet. Rule 1 of the policy ensures 
that traffic from the 192.168.1.0/24 subnet arrives on the correct interface. Rule 2 blocks traffic from the inse- 
cure wireless network to the server subnet. Rule 3 grants HTTP access to the web server to appropriate hosts. 
Rule 4 prevents external access to other servers. Rule 5 allows hosts on the trusted subnet to transmit packets 


that have not been blocked by some previous rule. 
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In this work, we present two novel techniques 
that enable “directed repair’’ of the firewall policy. 
Using these techniques, the system administrator not 
only can identify the existence of an error in the pol- 
icy, but can trace it backs to its root causes without an 
expensive manual inspection of the rule set. The first 
major contribution is a technique for providing exam- 
ple packets that illustrate that the firewall violates a set 
of security requirements. The second contribution cre- 
ates a history map which identifies the particular fire- 
wall rules which cause the firewall to deviate from its 
desired behavior. 


While these techniques do not fully automate the 
process of repairing the firewall, they do provide the 
system administrator with information that makes re- 
pair much easier than a simple verification of the pol- 
icy. We have implemented both techniques in ITVal 
[6], a firewall analysis and repair tool for iptables fire- 
walls. Although we use the Linux iptables firewall for 
the examples in this paper, it is possible to adapt these 
techniques to work with other platforms such as PIX 
and Checkpoint firewalls. There are also several fairly 
effective scripts for converting ipfw and ipchains fire- 
wall to iptables syntax [11] which can be used to adapt 
such policies to a format compatible with IT Val. 


The remainder of this paper is structured as fol- 
lows. The next sectoin describes the difficulties which 
a system administrator encounters in repairing a fire- 
wall. Then we discuss partially automated repair of the 
policy. The next two sections detail our techniques for 
generating examples and history, respectively. Sec- 
tions on implementation using MDDs and a descrip- 
tion how this work can be combined with our previous 
work on equivalence class analysis of a firewall policy 
follow. Finally, we discuss related work and make a 
few concluding remarks. 


Firewall Policy Errors 


The techniques discussed in our previous work 
allow a system administrator to perform basic logical 
queries against a firewall policy using a simple speci- 
fication language. For instance, to ask which hosts can 
access a web server, host 192.168.2.4, the administra- 
tor can use the query QUERY SADDY TO 192.168.2.4 
AND FORWARD ACCEPTED; which will list the source 
addresses of any host that can access the web server 
without being blocked by the firewall. Inspecting this 
list of addresses may allow the user to detect an error 
in the configuration of the firewall. If the address of a 
malicious host appears in the list, for instance, it is 
clear that there is a problem with the firewall policy. 


Queries allow the system administrator to iden- 
tify many serious errors in the configuration of a fire- 
wall, but provide only a limited amount of information 
about each error. For instance, if the system admini- 
strator uses the query ““Which hosts can connect to the 
mail server?”’ to determine whether the firewall blocks 
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external hosts, the analysis engine will list those hosts 
that have unwanted access to the server, but will not 
provide any additional information that can be used to 
understand why the firewall failed to prevent access. It 
may be that the error only occurs when connections 
are made on a particular network interface or by a par- 
ticular network protocol. Access to this information 
could greatly assist the system administrator in repair- 
ing the policy, but traditional testing tools do not pro- 
vide these helpful clues. Another way to think about 
this is to say that a query engine discovers an error 
(the ultimate consequence of a problem in the policy), 
but not the fault (the mistake in the rules that causes 
the error). 


Sometimes, helpful information can be obtained 
using additional queries or by refining the query to pro- 
vide more information. Often, however, the process of 
developing a sufficiently detailed set of queries requires 
almost as much effort as manual repair of the policy. 


This means that query tools are usually limited to 
detecting whether an error exists and have only lim- 
ited utility in guiding repair of the policy. To repair the 
policy by hand, the system administrator must care- 
fully consider each filtering rule to determine whether 
it is relevant to the error and, if so, whether it is cor- 
rect. Since most of the rules will usually be either 
irrelevant or valid, manual repair is a very inefficient 
and time consuming process, especially when an error 
has many potential causes. 


Web Server 
(192.168.2.4) 
; 
Outside ; 
World labas\ielil Workstations 





INSECURE 
NETWORK 


Figure 2: A typical firewall, which protects hosts on 
two subnets against intrusions from a third, un- 
trusted network and the outside world. One of the 
protected subnets contains a web server, host 
192.168.2.4, to which remote connections are al- 
lowed. 


Figure 1 shows how difficult it can be to trace a 
firewall error to its source. This rule set protects work- 
stations on the subnet 192.168.1.0/24 and servers on 
the 192.168.2.0/24 subnet against attacks from the 
outside world and an insecure wireless network on the 
192.168.3.0/24 subnet. An illustration of the network 
is given in Figure 2. 

The system administrator wants to allow access 
to the web server, host 192.168.2.4, from any system 
in the outside world except those on the unsecured 
wireless network. All other external traffic to the web 
server should be blocked. 
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It is fairly easy to determine that the rule set fails 
to enforce these requirements. If the system administra- 
tor opens up a web browser and tries to connect to the 
web server from a host on the trusted network, the fire- 
wall will refuse to allow the connection. Discovering 
the cause of this error is more challenging, since nearly 
every rule of the policy plays some role in the filtering 
decision. An error in rule 4, which drops traffic to the 
protected subnet, could be the source of the error. A 
typo in rule 3, which overrides rule 4 to allow web traf- 
fic to enter the network might be another the cause. 
Rule 1, an anti-spoofing rule which blocks traffic from 
the “wrong” interface, might also be to blame. 


As it turns out, the fault that produces the ob- 
served error is in rule 2. An incorrect subnet mask in 
rule 2 causes the firewall to block traffic from the pro- 
tected network as well as the untrusted net. Manual 
analysis of the policy requires a careful and tedious 
inspection of every rule in the policy to identify this 
fault. For the five rule policy shown here, this inspec- 
tion might not take too long. However, a policy with 
more than a few dozen rules would be much more dif- 
ficult to analyze. Partially automating the repair process 
in a way that narrows down the potential sources of the 
error to just one or two rules could save the administra- 
tor a significant amount of effort. 


Partially Automated Firewall Repair 


Unfortunately, it is impossible to fully automate 
repair of a generic firewall policy because incorrect 
behavior on one network may be expected behavior on 
another. For instance, on one network it may be desir- 
able to allow SMTP traffic to reach certain hosts, such 
as the mail servers. On another network, however, a 
policy that permits SMTP traffic may spam-bots to 
compromise important systems. Without input from 
the user, a repair algorithm cannot distinguish between 
these two cases. 


While a fully automatic strategy for firewall repair 
is impossible, partial automation is possible. Gouda, 
Liu, et al. have done significant work on repair of struc- 
tural errors in the firewall policy [4]. Their technique 
uses transformation of decision diagrams to produce an 
improved rule set in which problems such as shadowed 
or duplicate rules have been eliminated. This strategy 
does not require any assistance from the user. Unfortu- 
nately, these techniques do not address repair of logical 
errors such as typos or out-of-order rules. 


Another approach is to allow the user to make 
the final decision about how to repair the policy, but 
automate the process of finding the faults responsible 
for the error. By providing the system administrator 
with sufficient information about the possible causes 
of the error, we can guide her toward a few possible 
solutions, from which she can choose the one best 
suited to her network. This “directed repair” of the 
policy alleviates much of the tedious work required to 
find faults and fix the policy. 
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Directed Repair 


In previous work [7, 8, 9], we explored ways to 
detect errors in a firewall configuration using logical 
queries and an equivalence class decomposition of the 
network. In this paper, we describe two extensions of 
this work that enable directed repair of the firewall 
policy. One technique generates relevant counterexam- 
ples from which the system administrator can obtain 
detailed information about security failures in the pol- 
icy. The second technique provides an extensive “his- 
tory analysis’’ that identifies potential sources of the 
error and lists rules which should be considered for 
modification. The history analysis can also be used 
with the equivalence technique described in [9], which 
addresses the need for extensive preparation of logical 
queries. We implement both of these techniques as 
extensions to IT Val [6], an open source firewall testing 
tool developed as part of our previous work. 


FROM <address_range> 
matches all packets with source address in address_ 
range. 
TO <address_range> 
matches all packets with destination address in 
address_range . 
ON <port_range> 
matches all packets with source port in port_range. 
FOR <port_range> 
matches all packets with destination port in port_ 
range 
IN <s> 
matches all packets associated with connections in 
state s 
WITH <flag_set> 
matches all packets with the TCP flags in flag_set 
enabled 
ACCEPTED <chain> 
matches all packets accepted by built-in chain chain 
DROPPED <chain> 
matches all packets rejected by built-in chain chain 
INFACE <iface> 
matches all packets received by network interface 
iface 
OUTFACE <iface> 
matches all packets transmitted on network inter- 
face iface 
Figure 3: IT Val primitives. 


To use these techniques, the user specifies the 
desired behavior of the firewall using logical asser- 
tions. The syntax for assertions is derived from the 
query language explained in [8]. The right and left 
conditions of the assertion are built from a set of sim- 
ple primitives such as those in Figure 3, which can be 
combined using the logical operators AND, OR, and 
NOT to create complex conditions describing sets of 
packets whose treatment by the firewall requires anal- 
ysis. For example, we can describe all accepted SSH 
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packets from subnet 192.168.1.0/24 on interface eth0 
using the condition 
FOR TCP 22 AND 
FROM 192.168.1.* AND 
INFACE ethO AND 
(ACCEPTED FORWARD OR 
ACCEPTED INPUT); 
The user can construct two types of assertions from 
these conditions. Equality assertions have the form: 


ASSERT <A> IS <B> 


where A and B are conditions. Containment assertions 
have the form 


ASSERT <A> SUBSET OF <B> 


Equality assertions specify that those packets which 
match condition A are exactly those that match condi- 
tion B. Containment assertions specify that the set of 
packets that satisfy condition A is (non-strictly) con- 
tained in the set of packets that satisfy condition B. 
Using these assertions, the user can describe important 
high-level security invariants which the policy should 
always satisfy. 


For instance, the containment assertion 


ASSERT FROM 192.168.3.* 
SUBSET OF DROPPED FORWARD; 


specifies that any packet from subnet 192.168.3.0/24 
is dropped. The equality assertion 

ASSERT FROM 192.168.2.* 

IS (FOR TCP 80 
AND ACCEPTED FORWARD) ; 

can be used to check that only HTTP packets are 
allowed to enter the network from the 192.168.2.0/24 
subnet and that no other web connections are allowed 
by the firewall. We call the set of packets that match a 
condition its match set and the set of packets that cause 
an assertion to fail the assertion’s fail set. Assertions 
provide many advantages over simple queries. While 
queries allow the user to obtain a significant amount 
of information about the policy, a query does not pro- 
vide the analysis engine with any description of the 
expected behavior of the firewall. Therefore, using 
assertions enables the engine to provide more useful 
and relevant output. 


Counterexamples and Witnesses 


One useful advantage of assertion analysis is that 
it allows generation of relevant counterexamples. These 





&] Ww 


/4 | DROP 


[Target [Source | Destination | Interface | Flags _ 
PL [ACCEPT | anywhere | 192.168.1.024 | eth0 | dpttop 22, 
ACCEPT | anywhere | 131.106.3253 [etl | 
'3-[ DROP | 63.118.7.16 | anywhere | ethO | 

- 192,168.2.024 | anywhere [any | 
: 
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counterexamples provide a context for the error which 
can often help the administrator discover why a failure 
has occurred. 


The example policy in Figure 4 isolates an un- 
trusted research network 192.168.2.0/24 from the out- 
side world. SSH traffic from the untrusted network to 
hosts on subnet 192.168.1.0/24 is permitted, but all 
other traffic from the network is denied. The 192.168. 
1.0/24 subnet contains several world-accessible web 
servers to which the policy grants access. However, 
the rule set blocks connections from 63.118.7.16, a 
malicious host. Trusted hosts are allowed to make 
connections to the web servers and an external server, 
host 131.106.3.253, but cannot make any other con- 
nections. 


To test whether the untrusted hosts are suffi- 
ciently restricted by the firewall, the administrator 
uses the assertion 

ASSERT (FROM 192.168.2.* 
AND NOT FOR TCP 22) 
SUBSET OF DROPPED FORWARD; 


which specifies that only SSH traffic is accepted from 
hosts on the untrusted network. Due to an error in the 
ordering of rules 2 and 4, the assertion will fail. This 
subtle error could be very difficult to detect in a 
lengthier policy in which the rules were much further 
apart. Using ITVal, the administrator can easily dis- 
cover that the assertion fails. Knowing that the asser- 
tion does not hold is an important first step, but does 
not give much information about the cause of the 
error. To give the user more information about the 
source of the error, we generate a counterexample — a 
packet that demonstrates the falsity of the assertion. 
Figure 5 shows the generation of one possible coun- 
terexample. The user specifies that an example should 
be generated by inserting the keyword EXAMPLE at the 
beginning of the assertion. 


Examination of the counterexample gives the 
system administrator important information about the 
assertion failure. One significant clue is that the exam- 
ple packet arrived on interface eth1. Since only rule 2 
mentions eth1, this fact draws the administrator’s im- 
mediate attention to the rule ordering error, which can 
now be corrected by moving rule 2 to the correct loca- 
tion in the policy. 

Sometimes it is desirable to obtain an example 
even when an assertion succeeds. We call such an 











ny 





Figure 4: An incorrect forwarding chain which allows non-SSH traffic from hosts on the 192.168.2.0/24 network. 
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example a “witness,” since it illustrates the assertion. 
Witnesses are less powerful than counterexamples in 
that the existence of a counterexample demonstrates 
conclusively that an assertion is false while the exis- 
tence of a witness only demonstrates that it is possible 
to satisfy the assertion. Nevertheless, witnesses can be 
very useful for convincing yourself (or others) that an 
assertion really holds. They can also be useful for 
debugging certain kinds of problems that can best be 
tested with an assertion that you expect to fail. 


ASSERT EXAMPLE (FROM 192.168.2.* 
AND NOT FOR TCP 22) 
SUBSET OF DROPPED FORWARD; 
Assertion failed. 
Counterexample: 
TCP packet from 192.168.2.1:6362[eth1] 
to 131.106.3253 +25 fethi] 
in state NEW with flags[ de 


Figure 5: Counterexample for the example assertion. 


Suppose that you wanted to ensure that the fire- 
wall policy does not block all SMTP connections. To 
test whether your firewall correctly implements this 
policy, you might use the assertion: 


ASSERT EXAMPLE FOR TCP 25 
SUBSET OF DROPPED FORWARD; 


If the policy is correct, the assertion should fail, since 
the assertion implies that all SMTP packets are dropped. 
In this situation, a witness can often provide useful 
information that allows you to discover why the asser- 
tion succeeds. Since using assertions in this manner is 
very counter-intuitive, we provide the user with NOT 
SUBSET OF and IS NOT operators that can be used in 
place of the SUBSET OF and IS keywords. This allows 
the user to avoid using “backward assertions” like the 
one above. Using these operators, the user can test 
whether all SMTP packets are dropped as follows: 
ASSERT EXAMPLE FOR TCP 25 
NOT SUBSET OF ACCEPTED DROPPED; 


This new assertion will hold in exactly the cases in 
which the old assertion would fail and vice versa, but 
is much more intuitive to use. 


Rule History 


Witnesses and counterexamples provide the sys- 
tem administrator with very detailed information about 
the causes of an assertion failure. Nevertheless, it can 
be difficult to trace an error to the fault that causes it 
even when a good counterexample is available. We can 
obtain more precise information about the particular 
rules that create an error by constructing a “history 
map” during generation of the rule set MDD. The his- 
tory map matches each packet to the set of rules that 
potentially accept or drop it. 


Using the history map, we can associate packets 
in an assertion’s fail set with a small number of filter- 
ing rules — usually much smaller than the number of 
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rules in the entire policy. This permits the administra- 
tor to narrow his inspection of the policy to just a few 
critical areas. Since the set of rules to examine includes 
every rule that matches a packet in the assertion’s fail 
set, it is possible that we may list some correct rules as 
well as the incorrect ones. In many cases, however, the 
history map will enable the system administrator to 
ignore most of rules that are not related to the problem. 


Implementation 


To evaluate an assertion, we represent each con- 
dition using a Multiway Decision Diagram (MDD), a 
data structure which is suitable for representing and 
manipulating large sets of packets. An MDD is a 
directed acyclic graph in which the nodes are orga- 
nized into levels and all arcs from a node at a given 
level point to nodes at the level below. 


Source 1 [ey 
Source 2 [illsys; 


Source 3 






Source 4 
Dest 1 
Dest 2 
Dest 3 
Dest 4 


Protocol 
Dport 


Matches 
Figure 6: MDD representing FROM 192.168.1.*. 


Each level of the MDD corresponds to an at- 
tribute such as protocol, connection state, or destina- 
tion port. For instance, level K, the top level of the 
graph, represents the first source octet of a packet. The 
bottom level of the MDD is a special terminal level 
which indicates whether or not a packet belongs to the 
match set. For space reasons, our figures show only 
some of the levels of each MDD. We also use an aster- 
isk as a wildcard character to represent “‘all arcs not 
explicitly listed in this node” when many of the arcs 
leading from a node point to the same child. 


To construct an MDD representation of a primi- 
tive such as FROM <address> or FOR <port>, we use 
nodes with all arcs pointing to the same child to mask 
out all but the relevant levels of the MDD. To repre- 
sent FROM 192.168.1.*, we start at the top of the MDD 
and work down, inserting a node with just one arc 
labeled “192” at the top level (since the top level cor- 
responds to the first octet of the source address). This 
arc connects to a node at the next level down which 
has a single arc labeled “168” which points to a node 
with an arc labeled “1” in the next level. The “‘2”’ arc 
connects to a node representing the fourth source octet 
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of the condition. All the remaining nodes in the graph 
are labeled with the wildcard character, since the other 
criteria are not relevant to the condition. This process 
is illustrated by Figure 6, which shows a simple condi- 
tion MDD. 


To test whether a particular packet is in the 
match set of a condition, we simply descend the MDD 
from its root to a terminal node using properties of the 
packet to guide the descent. If we reach the “matches” 
node, the packet is in the set. Otherwise, it is not. In 
Figure 6, a packet from 192.168.1.1 to 64.130.15.7 on 
TCP port 25 matches the condition, but a packet from 
192.168.4.1 does not, since there is no are for ““4”’ leav- 
ing the node for source octet three. 


The primitives ACCEPTED <chain> and DROPPED 
<chain> require special treatment. To generate MDDs 
for these conditions, we construct an MDD representa- 
tion of each firewall chain. We then remove all the 
paths except those which point to the correct terminal 
node (either ACCEPTED or DROPPED) using a pro- 
jection operation. 


Source 1 
Source 2 
Source 3 
Source 4 


Dest 1 
Dest 2 
Dest 3 
Dest 4 
Protocol 
Dport 
Interface 





Figure 7: MDD for the FORWARD chain. 


Figure 7 shows part of the MDD for the chain in 
Figure 1. To create an MDD representing ACCEPTED 
FORWARD, we copy those paths of the chain MDD 
which lead to the ACCEPTED node into the condition 
MDD and ignore paths leading to the DROPPED 
node. The resulting MDD is given in Figure 8. 


Complex conditions containing the AND, OR, 
and NOT operators can be represented by using MDD 
intersection and union operators to combine the primi- 
tive MDDs. An example MDD for a more complex 
condition is given in Figure 9. Union and intersection 
can be performed very efficiently using MDDs. Using 
operation caches, we can obtain a guarantee that each 
pair of nodes in the graph is visited only once during 
these operations. Since the number of nodes in the 
graph is usually much smaller than the number of 
packets represented by each condition, we can com- 
plete these operations very rapidly. The complement 
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operator is also very efficient. It requires a single 
descent of the MDD, which is linear with respect to 
the size of the graph. 


Source 1 
Source 2 
Source 3 
Source 4 


Dest 1 
Dest 2 
Dest 3 
Dest 4 
Protocol 





Source 4 
Dest 1 
Dest 2 
Dest 3 
Dest 4 

Protocol 
Dport 

Matches 


Figure 9: MDD representing FOR TCP 25 AND (FROM 
192.168.1.* OR FROM 192.168.2.*). 





To determine whether a containment assertion 
holds, we examine the set of packets that match condi- 
tion A, but do not match condition B. If the assertion 
fails, this set will be non-empty, as illustrated by Fig- 
ure 10. 


The pseudocode in Figure 11 describes this pro- 
cess in detail. First, we construct MDDs representing 
the packets that match condition A and condition B, 
respectively, in steps 1 and 2. In step 3, we use an 
MDD complement operation to find the set of packets 
which do not match condition B, the right-hand side of 
the assertion. We intersect the MDD returned by the 
complement operation with the MDD representing 
condition A, the left-hand side of the assertion, in step 
4. This creates an MDD representing the fail set of the 
assertion. In steps 5 through 8, we test whether the set 
is empty and return an appropriate value. 


To test an equality assertion, we use the algo- 
rithm given in Figure 13, which is similar to the 
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algorithm for testing a containment assertion. Steps 1 
through 7 create an MDD representing the fail set. If 
the fail set is non-empty, we have the situation illus- 
trated by Figure 12 and the assertion fails. If it is 
empty, the assertion holds. 


Figure 10: Fail set for the SUBSET OF operator. 


bool testSubsetAssertion(cond A, cond B): 


[1] mddA = condition_to_MDD(A); 

[2] mddB = condition_to_MDD(B); 

[3] notB = MDD_complement (mddB) ; 

[4] result = MDD _intersect(mddA, notB); 
[5] if notEmpty(result) then: 


[6] return ASSERTION FAILED; 
[7] else: 
[8] return ASSERTION HELD; 


Figure 11: Checking a containment assertion. 


8 


Figure 12: Fail set for the IS operator. 


These techniques allow us to determine whether 
or not a firewall policy satisfies a set of assertions. 
These operations form a basis for performing more 
advanced calculations, such as example generation 
and history analysis. 


Implementing Examples 


To generate the counterexample for an assertion, 
we change the algorithms in Figure 11 and Figure 13 
to return an arbitrary element from the fail set. This is 
done by replacing the last four lines of each algorithm 
with those in Figure 14. 

The function choose_element(X) picks an arbi- 
trary element from the set represented by MDD X. If 
the assertion does not fail, we choose an element from 






|# | Target _| 





Chain Forward (Default DROP) 


}1 | DROP | 192.168.2.0/22 | anywhere | 
ACCEPT 192.168.3.0/24 | 
ACCEPT dpt:tep 25 
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the fail set as the counterexample. If the assertion 
fails, we choose an element from the match set of the 
left-hand condition as a witness, since the elements of 
that set must match both conditions. To select an ele- 
ment, the choose_element function walks the MDD 
from the root node to the bottom of the graph, arbitrar- 
ily selecting arcs at each level (in practice, we select 
the first non-zero arc of each node) and storing each 
selected attribute in a “packet” structure which can be 
printed at the end of the traversal. 


bool TestISAssertion(cond A, cond B): 
[1] mddA = condition_to_MDD(A) ; 
[2] notB = MDD complement (mddB) ; 
[3] resultA = MDD_intersect(mddA, 
[4] mddB = condition_to_MDD(B) ; 
[5] notA = MDD complement (mddA) ; 


[6] resultB = MDD intersect(notA, mddB); 
[7] result = MDD_union(resultA, resultB); 


[8] if notEmpty(result) then: 
[9] return ASSERTION_FAILED; 
[10] else: 

[11] return ASSERTION_HELD; 


Figure 13: Checking an equality assertion. 


bool TestSubsetAssertion(cond A, cond B): 
[5] if notEmpty(result) then: 


[6] return choose_element (result); 
[7] else: 
[8] return choose_element(mddA) ; 


Figure 14: Generating an example. 


Implementing History 


In order to build the history map, we construct a 
“history MDD” for each built-in chain of the firewall. 
The history MDD is similar to the MDD for a rule set 
or an assertion, but has two extra levels at the bottom 
of the graph. The top levels of the MDD correspond to 
the levels of a rule set MDD. The extra levels at the 
bottom represent a chain identifier and an index for 
each rule. We reserve index 0 for the default policy of 
a chain and index the remaining rules sequentially 
starting from 1. Construction of the history MDD is 
done concurrently with construction of the MDD repre- 
sentation of each firewall chain. As we insert rules into 
the rule set MDD for the chain, we copy those rules 
into the history MDD, adding nodes to identify the 
chain and rule to the bottom levels. If we encounter a 
rule which matches packets already matched by some 
other rule, we use MDD union to store a mapping to 
both rules. 


Suppose we want to test the assertion 









Figure 15: Example rule set which protects a network 192.168.4.0/24. 
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ASSERT HISTORY NOT FROM 192.168.2.* 
AND TO 192.168.4.* 
AND FOR TCP 22 
SUBSET OF ACCEPTED FORWARD; 


against the policy given in Figure 15. The assertion 
verifies that SSH traffic is allowed to a protected net- 
work 192.168.4.0/24 unless it originates on an un- 
trusted network 192.168.1.0/24, from which all traffic 
is blocked by the firewall. The assertion will fail due 
to a typo in the subnet mask of the source address in 
rule 1. As a consequence of this fault, an SMTP 
packet from 192.168.3.1 to 192.168.4.2 that should be 
accepted will be dropped. 


An example history MDD for the rule set is 
given in Figure 16. To save space, only some levels of 
the MDD are represented in the figure. 


We can use the history MDD to find the rules 
that match this packet by descending the graph. To 
make this descent easier to follow, we have high- 
lighted the path representing the example packet in 
Figure 16. Starting from the root node, we follow the 
arc labeled 192, since the source address of our exam- 
ple packet begins with 192. We next follow the arc 
labeled 168. 


Because there is a typo in the subnet mask of rule 
1, the node we are now examining has arcs for 0, 1, and 
3 that point to the same child node as the are for 2. We 
follow the arc labeled 3 to a node with all arcs pointing 
to the same child. We continue past the child node. The 
destination address of our example packet begins with 
192, so we follow the arc labeled 192. We then follow 
the arc labeled 168 and the arc labeled with a wildcard, 
which represents all values other than 3. 


This brings us to another node with all arcs 
pointing to the same child, which we continue past. 
We now take the arc labeled TCP, then the arc labeled 
25. This takes us to a node at the chain level. The only 
arc leaving this node is labeled 1, so we know that the 
only rules that affect this packet are in the FORWARD 
chain. Following the arc takes us to a node represent- 
ing rules 1, 3, and the default policy (represented by 


Source Address 1 


Source Address 2 * 
Source Address 3 Pa 
Source Address 4 re 
Destination Address 1 * 192 
Destination Address 2 eA *168 
Destination Address 3 Ee i] 
Destination Address 4 FF * 
Protocol *x TCP * TI 
Dest. Port ey *25 Hx 
Chain 45 if iF 
Rule Index 0 O380 
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the label 0). Rule 2 is not listed since it matches only 
packets sent to subnet 192.168.3.0/24. 


Source 
Address 


Destination 
Address 


Protocol 


Chain 


Rule 
Figure 17: History MDD for an assertion. 








This traversal gives the history map for a single 
packet. To find the rules that match those packets that 
violate the assertion, we intersect the history MDD for 
a chain with an MDD representing the fail set of the 
assertion. The fail set MDD can be computed as for 
counterexample generation, except that the result must 
be extended to include levels for the rule index and 
chain identifier. This can be done by padding the fail 
set MDD with wildcard nodes at the bottom two lev- 
els. This is illustrated by Figure 17, which gives an 
extended fail set MDD for the assertion. 

The top four levels of the MDD correspond to 
the source addresses of packets that match the asser- 
tion. In this case, the assertion matches all packets 
except those from the 192.168.2.0/24 subnet. The next 
four levels represent the destination addresses of pack- 
ets that match the assertion. In the example, only 
packets to subnet 192.168.4.0/24 match. The next lev- 
els represent protocol and destination port. The bottom 
levels represent the chain identifier and rule index of 
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Figure 16: History MDD for a firewall chain. 
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the packet. Since we have not yet determined which 
rules are related to the assertion, these are represented 
as wildcard nodes. 


Source Address 1 
Source Address 2 
Source Address 3 
Source Address 4 
Dest. Address 1 
Dest. Address 2 
Dest. Address 3 
Dest. Address 4 
Protocol 
Dport 
Chain 
Rule 
Figure 18: Result MDD after intersection. 





Intersecting the extended fail set MDD with the 
rule set MDD gives us the history map MDD in Figure 
16. IT Val converts this graph into the human readable 
output given in Figure 19. From this output, it is very 
easy to see that the fault lies in either rule 1 or rule 3. 
Rule 2 is ignored, since it only matches packets that 
do not cause the assertion to fail. In a longer policy, 
other extraneous rules would also be ignored. 


ASSERT HISTORY NOT FROM 192.168.2.* 
AND TO 192.168.4.* AND FOR TCP 25 
SUBSET OF ACCEPTED FORWARD; 
##Assertion failed. 
Critical Rules: 
Firewall 0 Chain 1 Default Policy. 
Firewall O Chain 1 Rule 1: 
DROP all -- * * 192.168.2.0/22 
0.0.0.0/0 
Firewall O Chain 1 Rule 3: 
ACCEPT tcp -- * * 0.0.0.0/0 
0.0.0.0/0 tep dpt:25 


Figure 19: Human readable history map. 


In this case the fault lies in rule 1. An examina- 
tion of that rule quickly reveals the typo. Our algo- 
rithm also lists rule 3 and the default policy of the 
FORWARD chain. Rule 3 is the rule that should have 
accepted the dropped packets and therefore may help 
the administrator understand the error. The default 
policy can be ignored, since it always matches every 
packet seen by the firewall. Using this enumeration of 
the matching rules, the system administrator can con- 
centrate on the rules directly relevant to the problem. 


History and Equivalence Classes 


It is often much easier to use assertions than to 
perform a manual inspection of the policy. For one 
thing, the rules in a policy interact with each other in 
ways that can be confusing to the user. One rule in the 
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policy might mask another rule or cause the rule to be 
applied only in certain, unusual, circumstances. Be- 
cause each assertion is independent of the others, writ- 
ing and understanding a list of assertions is often eas- 
ier than manually correcting the rule set. More impor- 
tantly, it is possible to construct a partial or high-level 
specification of the policy using assertions. This par- 
tial specification can ignore many of the details of the 
policy, which allows it to be simpler than the rule set 
to which it is applied. 


Nevertheless, debugging the firewall using asser- 
tions has certain limitations. There is a tradeoff between 
the completeness of a specification and how easy the 
specification can be constructed. Deriving assertions 
that are both useful and effective can be a very chal- 
lenging task. 

Another limitation of the assertion approach is 
that certain kinds of faults cannot easily be identified 
using history maps for an assertion. 


The policy in Figure 20 is supposed to protect a 
secure subnet 192. 168.2.0/24 from intrusions on an 
untrusted network 192. 168. 1.0/24. An assertion check- 
ing that legitimate SSH traffic can reach the protected 
network is also given in the figure. A typo in rule 2 
causes the assertion to fail. Unfortunately, the history 
map for the assertion will show only the default policy. 
None of the other rules in the policy match any packets 
in the fail set. In particular, rule 2, which contains the 
fault, does not match any packets from the 192.168. 
2.0/24 subnet and, therefore, is not listed. 


One way to address this problem is to create an 
assertion that checks whether packets from 192.186. 
2.0/24 are accepted. The history map for such an 
assertion would immediately identify the typo in rule 
2. The problem with this is that the system administra- 
tor has no way of knowing such an assertion is 
needed. It is not practical to create assertions for all of 
the possible typos in a policy, since doing so would 
require at least as much work as manual inspection of 
the policy. 

A better way to address the problem is to extend 
the technique described in [9] to provide history infor- 
mation that can be used to discover faults in the pol- 
icy. In that work, we described a technique for separat- 
ing the computers on a network into related classes of 
hosts based on information taken directly from the fire- 
wall policy. Hosts in each class are equivalent in the 
sense that the firewall will accept or drop a packet from 
(or to) a host in the class only if it will accept or drop 
an otherwise identical packet from (or to) any other 
host in the class. For instance, if the firewall drops all 
SSH packets from host A, it will also drop all SSH 
packets from host B if they are in the same class. Each 
class of hosts on the network is represented as a five 
level “class MDD” which can be manipulated effi- 
ciently using MDD intersection and union operators. 


Figure 21 lists the three classes shown in Figure 
20. Class 2 corresponds to the untrusted subnet 192. 


21st Large Installation System Administration Conference (LISA ’07) 35 


Assisted Firewall Policy Repair Using Examples and History 


168.1.0/24. Class 3 is an anomalous class of hosts 
caused by the typo in rule 2. The existence of this 
class is an immediate clue to the system administrator 
that the firewall policy contains a serious error. Class 
1 corresponds to all other hosts on the network. 


QUERY HISTORY CLASSES ; 


There are 3 total host classes: 
Class 1: 


<Everything not in the other classes> 


Class 2: 

192.168.1. [0-255] 
Class 3: 

192.186.2. [0-255] 


Figure 21: Equivalence class decomposition of a pol- 
icy. 


Partitioning the hosts on a network into equiva- 
lence classes allows us to generate a “policy map” 
that shows functional groupings of the hosts on a net- 
work. The only input necessary to create a policy map 
is the firewall policy itself. When the policy contains a 
fault, it will often be manifested in the policy map as a 
missing class or by the presence of an unexpected 
class of hosts. The equivalence class technique can 
detect many kinds of faults that are difficult to identify 
using assertions. These faults include typos, overly 
broad rules, shadowed rules, outdated rules, and even 
missing rules. Unfortunately, while the policy map 
assists the system administrator in detecting these 
problems, it provides him with little information that 
can be used to identify the rules that must be changed 
to repair the issue. 


We can enhance the policy map by annotating 
each class of hosts with a list of rules that match pack- 
ets to and from a host in the class. To do this, we 
extend each class MDD with wildcard nodes. The 
resulting graph is similar in structure to the condition 
MDDs used to analyze assertions, but has wildcards at 
every level except the source address levels. This 
MDD matches the set of all packets whose source 
address matches a host in the class. We then repeat the 
procedure to produce an MDD with wildcards every- 
where except the destination address levels. We can 
now intersect with the history MDDs for each chain to 
determine which rules match these packets. This inter- 
section generates a result MDD which can be trans- 
lated into a human-readable history map. 






Source 





ASSERT HISTORY TO 192.168.2.* AND 
FOR TCP 22 AND 
NOT FROM 192.168.1.* 
SUBSET OF ACCEPTED FORWARD; 


Flags 
Pi DROP | 192.168.1.0/24 | anywhere _ | 
192.186.2.0/24 
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An MDD representing all packets with source 
address from class 3 is given in Figure 22. The top 
four levels of the MDD correspond to source ad- 
dresses on subnet 192.186.2.0/24. The remaining lev- 
els contain wildcard nodes. 


Source 
Address 


Destination 
Address 





Protocol 





Destination Port —— 
Chain 


Rule = —— 
Figure 22: History MDD for class three. 





Class 3: 
Firewall 0 Chain 1 Default Policy. 


Firewall O Chain 1 Rule 2: 


ACCEPT all -- * * 0.0.0.0/0 192.186.2.0/24 


tep dpt:22 
Figure 23: History Map for class three. 


A portion of the history map for the equivalence 
classes of the policy in Figure 20 is given in Figure 
23. The existence of an anomalous class containing 
hosts from the 192. 186.2.0/24 subnet immediately 
alerts the system administrator to a serious error. A 
quick glance at the history map for class 3 reveals that 
only two rules are of interest: the default policy and 
rule 3. The system administrator now takes a careful 
look at rule 2 and discovers the fault, which enables 
her to repair the policy. 


Existing Work 


There are many good techniques for finding 
errors in a firewall policy. Tools such as nmap [3], 
Nessus [5], and ftester [1] allow the system admini- 
strator to actively test a firewall for specific well- 
known vulnerabilities. Unfortunately, these tools are 







tcp dpt:22 





Figure 20: A fault that history mapping misses. 
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little help in identifying the faults which cause an 
error. For instance, nmap may indicate that a critical 
network port is unavailable for a variety of reasons: if 
the host is down, the firewall prohibits access to that 
port, the TCP wrappers on the server are incorrectly 
configured, or a routing error makes the host unreach- 
able. Distinguishing between these potential causes is 
extremely difficult. Once the error has been narrowed 
down to the firewall, these tools do not provide any 
information about the policy itself that aid the user in 
determining why the error has occurred. 


More rigorous testing can be done using passive 
testing tools, such as the Lumeta firewall analysis 
engine [10, 12], that perform an exhaustive off-line 
analysis of the policy. These tools simplify the task of 
determining whether the firewall policy contains errors 
by allowing the user to specify a set of logical queries 
that provide information about the policy. Some work 
has also been done on using expert systems to test the 
firewall policy [2]. The Lumeta engine provides sup- 
port for History Mapping and limited example genera- 
tion. However, the Lumeta engine is a proprietary 
closed-source product, which does not support ipta- 
bles. These passive analysis tools also do not provide 
class-based analysis and therefore require the user to 
invest a significant amount of time designing appro- 
priate queries or specifications against which the pol- 
icy must be tested. 


Conclusion 


Using examples and history mapping, a system 
administrator can easily identify the two or three critical 
rules in a rule set that lead to a serious firewall error. 
Detecting these faults greatly reduces the amount of 
time an administrator must spend in careful examina- 
tion of the policy and makes it much easier to manage 
and maintain a large, restrictive firewall policy. Using 
counterexamples and witnesses, the system administra- 
tor also gains valuable knowledge about the circum- 
stances under which an error occurs. Using rule history 
with equivalence classes allows the system admini- 
strator to quickly and easily detect both errors and 
faults in the policy without constructing a large num- 
ber of complicated assertions. The only required input 
is the policy itself. This greatly simplifies the process 
of maintaining, debugging, and repairing a restrictive 
firewall policy on a large network. 


The techniques for generating a history map and 
relevant counterexamples have been implemented in 
our tool, IT Val, which can be downloaded from http:// 
itval.sourceforge.net. The web site also provides sev- 
eral example specification files which can be down- 
loaded and customized for use on a variety of networks. 
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ABSTRACT 


Computer and network administrators are often confused or uncertain about the behavior of 
their networks. Traditional analysis using IP ports, addresses, and protocols are insufficient to un- 
derstand modern computer networks. Here we describe NetADHICT, a tool for better under- 
standing the behavior of network traffic. The key innovation of NetADHICT is that it can identify 
and present a hierarchical decomposition of traffic that is based upon the learned structure of both 
packet headers and payloads. In particular, it decomposes traffic without the use of protocol dissec- 
tors or other application-specific knowledge. Through an AJAX-based web interface, NettADHICT 
allows administrators to see the high-level structure of network traffic, monitor how traffic within 
that structure changes over time, and analyze the significance of those changes. NetADHICT al- 
lows administrators to observe global patterns of behavior and then focus on the specific packets 
associated with that behavior, acting as a bridge from higher level tools to the lower level ones. 
From experiments we believe that NetADHICT can assist in the identification of flash crowds, 


rapidly propagating worms, and P2P applications. 


Introduction 


Network administrators are regularly confounded 
by the behavior of the networks they manage. Part of 
this confusion is a function of the rapid innovation in 
applications and protocols; it also arises from simple 
human unpredictability. Much of the blame for the 
mystery of computer networks, though, can be laid at 
the feet of our tools: we simply do not have the means 
for truly understanding what is happening in our net- 
works [11]. 


To be sure, we have numerous tools for monitor- 
ing networks. Packet volume monitors can alert admin- 
istrators to gross changes in network behavior. Flow re- 
construction and protocol dissectors can reveal the be- 
havior of individual connections. Signature scanners 
can identify specific security problems, and anomaly 
detectors can tell us that something is “different.” As 
we discuss in the Related Work section, these tools, 
while useful, are not sufficient for us to achieve net- 
work awareness. 


One key piece that is missing is a way to view 
traffic at the “right” level of abstraction. For example, 
when dealing with a surge in web traffic, we need more 
detail than “80% of your traffic is HTTP’; however, 
analyzing the patterns in 50,000 HTTP connections is 
sure to induce information overload. In order to address 
this shortcoming, we need tools that can automatically 
extract multiple human-comprehendable abstractions of 
network behavior (e.g., most of connections have slash- 
dot in the HTTP referrer field). By understanding the 
structure of the observed abstractions and by seeing 
how future traffic fits into them, an administrator can 
quickly come to understand how network behavior 


changes at a level of granularity that facilitates both 
holistic understanding and appropriate response. 


We have developed a tool called NetADHICT 
(pronounced “‘net-addict’’) for extracting and visualiz- 
ing context-dependent abstractions of network behav- 
ior. NetADHICT hierarchically decomposes network 
traffic: for example, observed packets are first divided 
into IP and non-IP groups, IP packets are then split into 
TCP and non-TCP, and so on. What is notable about 
NetADHICT, though, is that its decomposition is auto- 
matically derived from observed traffic; in other words, 
it learns an appropriate context-dependent hierarchical 
classification scheme automatically with no built-in 
knowledge of packet or protocol structure. 


NetADHICT is able to perform this feat through 
the use of a novel clustering method we refer to as ““Ap- 
proximate Divisive Hierarchical Clustering (ADHIC).” 
When applied to traces of captured traffic, NetADHICT 
generates hierarchical clusters that closely correspond to 
the kind of semantically meaningful abstractions that 
are normally used to describe network behavior [10]. 
Remarkably, NetADHICT often aggregates packets 
without the use of ports or IP addresses; it also groups 
packets across multiple flows when there is significant 
commonality. 


Although we have not attempted to optimize 
NetADHICT, it is designed to be extremely efficient, 
allowing use of the tool at wire speed. We have bench- 
marked NetADHICT at 245 Mb/s. NetADHICT can 
achieve this speed because it classifies packets using 
sets of fixed-length strings located at fixed offsets with- 
in packets. We refer to these patterns as (p, n)- grams. 


Of course, fancy algorithms are not enough to 
make a useful tool for observing networks; we also 
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need appropriate interfaces for visualizing and inter- 
acting with observed traffic. To that end, this paper de- 
scribes NetADHICT from the point of view of a net- 
work administrator. Specifically, we explain how Net- 
ADHICT’s interactive AJAX-based web interface can 
be used to understand network behavior in a new, and 
we believe highly useful, way — one that gives insights 
into the sub-protocol behavior of networks in a way 
that preserves user privacy. Even though NetADHICT 
is still evolving, it has already proved to be a valuable 
tool for understanding networks. NetADHICT is made 
available under the GNU GPL license [13]; we hope 
that this paper will encourage others to get involved in 
the evolution of NetADHICT. 


The rest of this paper proceeds as follows. The 
next section explains related work in tools and meth- 
ods for understanding network traffic. We then give a 
detailed rationale for the use of hierarchical clustering 
to analyze networks and subsequently describe the 
ADHIC clustering algorithm. The interface and imple- 
mentation of NetADHICT are described in the next 
section which is followed by a narrative on how one 
uses the tool in conjunction with other tools to allow 
administrators to better understand their networks. We 
then discuss limitations of NetADHICT and plans for 
future work before concluding by describing how to 
obtain our tool and a summary of our work. 


Related Work 


Fundamentally, the problem of understanding net- 
work behavior is one of data reduction: we must trans- 
form gigabytes of network traffic into a small number 
of concepts and details that can be understood and act- 
ed upon by human administrators. As there are many 
different ways in which we can choose to understand 
network behavior, there is correspondingly a variety of 
tools that are designed to answer different questions. 


Network and systems administrators have a vari- 
ety of mature tools to select from for counting packets 
and flows. From open source solutions such as MRTG 
[16] and FlowScan [19] to large commercial offerings 
such as HP Openview [17] and IBM Tivoli [12], tools 
are readily available for visualizing the packet and 
flow statistics records that can be exported by many 
routers. Such solutions allow an administrator to deter- 
mine bandwidth consumption on a host, network, or 
port basis. For well-behaved applications that use 
standard ports, such views can be useful for identify- 
ing deployed applications; as most peer-to-peer proto- 
cols and other multimedia protocols do not use stan- 
dard ports or masquerade as HTTP (port 80) traffic, 
more sophisticated tools are needed to identify the 
most bandwidth-hungry applications. 


The most common strategy for monitoring eva- 
sive applications at the network layer is to combine 
flow reconstruction with deep packet inspection and 
application-specific identification rules (signatures). 
Tools such as Wireshark (formerly Ethereal) [3] are 
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designed to perform such analysis on captured packet 
traces. Commercial network forensics tools [4] facili- 
tate ongoing traffic capture so that recent behavior can 
be queried on demand to provide packet, flow, appli- 
cation, and even user-level views of traffic. In con- 
trast, other commercial products such as those pro- 
duced by Sandvine [22] can analyze packets at wire 
speeds, primarily to enable per-application traffic sha- 
ping. What these systems have in common is that they 
rely upon elaborate sets of rules in order to identify, 
dissect, and even change application behavior. While 
such rules are generally crafted by hand, there has 
been extensive research in identifying applications and 
usage patterns using a variety of machine learning 
techniques [2, 14, 6]. 

Network administrators, however, are interested in 
more than monitoring what applications are running; 
they also need to know when human intruders or mal- 
ware have compromised the security of their systems. 
Intrusion detection systems are most commonly built 
upon elaborate rule databases that specify the signatures 
associated with known attacks [21] or specifications of 
what network activity is and is not allowed [18]. One 
fundamental problem for all such systems, though, is 
that disruptive network activity often is caused not by 
attackers but by highly popular legitimate applications 
or services (e.g., flash crowds). Either the rules have to 
be broad enough to not signal alarms in some disruptive 
conditions, or administrators have to tolerate a regular 
stream of false alarms. 


Anomaly detection systems have the potential to 
adjust rules to local conditions through the use of ma- 
chine learning mechanisms; behavior at the network 
layer, however, is highly dynamic, and in fact attempt- 
ed attacks are now routine on the Internet. For these 
reasons, some researchers are coming to believe that 
anomaly detection at the network layer is a question- 
able strategy for detecting security incidents [9]; such 
reasoning, however, does not exclude the use of anom- 
aly detection for detecting networking problems in 
general [1]. 

While these various techniques for monitoring 
and classifying network behavior have their place, 
there is also a need for tools that can uncover other as- 
pects of the structure of network traffic. The next sec- 
tion explores what is missing in currently available 
tools. 


Understanding Network Traffic 


To understand network behavior we need to ob- 
serve more than changes in packet volume or security 
status. In addition, we need to discern patterns in traf- 
fic that allow us to group related communication ac- 
tivities together. While we cannot hope to find all pos- 
sible patterns, there are a number of ways to capture 
simple patterns in network traffic. The general strategy 
for grouping packets together using automatically 
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learned patterns (or features) is to employ a clustering- 
based machine learning algorithm. 


One strategy for clustering packets is to learn pat- 
terns associated with sets of flows. For example, Auto- 
Focus [7] is a system that hierarchically clusters pack- 
ets using IP address and port information so that tempo- 
ral patterns in the activities of groups of hosts can be in- 
ferred. However, there are many interesting patterns 
which cannot be defined in terms of protocol and ad- 
dress. Ma, et al. [14] addressed this limitation by build- 
ing classifiers from network flows using the header and 
first 64 bytes of payload from the initial packet in each 
flow. Their classifiers were built using clusterers. Clus- 
terers don’t label data, they simply group it together. 
Once a clusterer was constructed, the packets in each 
cluster were examined and assigned to a particular pro- 
tocol, turning the clusterer into a classifier. Although, 
Ma’s approach is promising, it is not fast enough to be 
used to monitor networks online. In fact, most general 
clustering and classification algorithms are simply too 
slow to learn at wire speed. This limitation is significant 
because we want to observe changes in behavior as they 
happen so that we may react to changes in a timely fash- 
ion. 

NetADHICT differentiates itself from other traf- 
fic clustering tools in that NetADHICT clusters traffic 
without knowing in advance any details about the 
packets’ structure and does so in a way that can exe- 
cute at the speed of the network. This allows it to find 
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its own differences between the packets — differences 
that adapt automatically to the traffic as it changes. 
Traffic volume patterns can be watched online and 
compared between the differentiated clusters. These 
traffic groups are presented visually through a tree: 
Packets that are found to be internally dissimilar are 
clustered separately, so that users immediately observe 
that they are different in nature. Also, each cluster’s 
position in the tree contains a wealth of context about 
how the traffic for that cluster is similar and different 
from the rest of the network’s traffic. Volume of traffic 
through each cluster is displayed in the tree in near re- 
al-time, allowing administrators to analyze traffic pat- 
terns as they emerge. 


High level tools can often show that something is 
wrong. Low level tools give tremendous detail, but of- 
ten are too slow or too information-rich to be used 
Clustering is a form of filtering that allows one to fo- 
cus on relevant behavior. NetADHICT allows the inte- 
gration of high and low level tools while providing 
enough information to allow administrators to move 
smoothly from high to low. 


The NetADHICT interface is designed to facili- 
tate such multi-level exploration. It is built around the 
cluster tree approach to network volume visualization. 
Each cluster of traffic within the tree is identified by a 
traditional classifier and colored appropriately. (The 
traditional classifier is built around EtherType, IP pro- 
tocol, and IP port.) This coloring of the clusters, along 
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Figure 1: An ADHIC cluster decision tree after 3 node splits. The labelling of the internal nodes and the pie charts 
of the terminal nodes are explained in the ““NetADHICT” section. 
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with their position in the tree, highlights traffic that is 
contextually different but using the same ports. It pro- 
vides an excellent starting point for delving further in- 
to different classes of traffic in the network and high- 
lights classes that would have remained completely 
hidden otherwise. In this fashion NetADHICT is a tool 
for enhancing network awareness. 


Hierarchical Clustering with (p,n)-grams 


NetADHICT is centered around a novel cluster- 
ing algorithm that recursively splits the set of ob- 
served packets into smaller and smaller groups. We 
call our algorithm Approximate Divisive Hlerarchical 
Clustering (ADHIC), the root of NETwork ADHIC 
Tool (NetADHICT). This algorithm is classified as a 
type of machine learning algorithm broadly known as 
divisive hierarchical clustering [5], but ours is sub- 
stantially different than other general purpose or net- 
work specific clustering algorithms. We give an over- 
view of ADHIC below; for more details, please see 
Hyazi [10]. 

The feature we use to split groups of packets is 
called a (p,n)-gram. (p,n)-grams are used to denote 
substrings at fixed offsets within packets. The p is the 
offset and n is the length of the substring. For exam- 
ple, the (p,n)-gram (8, Oxdce3b) denotes the 2 byte 
hexadecimal substring 0xdc3b 8 bytes into the packet 
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(these are the middle bytes of the source MAC address 
in the Ethernet frame). 


Divisive hierarchical clusterers form decision trees. 
The tree is constructed using the split and merge oper- 
ations. When the average bandwidth of a terminal 
cluster (leaf) exceeds a configurable threshold over a 
certain time window, the cluster is split into two clus- 
ters. The existing node becomes an internal decision 
node and two new terminal clusters are formed. A 
(p,n)-gram is found by examining the packet cache, a 
buffer that stores a few percent of recently received 
packets. (p,n)-gram frequencies are calculated for 
packets assigned to the terminal cluster in question, 
and a (p,n)-gram is chosen which matches roughly 
half of them. Conversely, if a node averages less than 
a set bandwidth limit, it is deleted and its parent be- 
comes a leaf node. Newly created nodes cannot be 
merged for a certain number of minutes, called the 
maturation period, to prevent transient behavior from 
affecting tree structure. 


Through these two operators, ADHIC generates 
a decision tree that specifies the contents of clusters. 
The path from the root of the tree to a leaf or other 
internal cluster specifies the boolean equation of 
(p,n)-grams which determine if a packet belongs to 
that cluster. 


NetAdhict - Gran Paradiso 


- @ ©) [Ej httpziecatmst:3000/ 


NetAdhict v1.0 


I=] 


Time Range 
©@ Retative © absolute 
Until: (Nw US 
Duration: [4rous S| 
[ Go } 
Node Details 


Node 0 - 16,163,072 packets 


fg ether_ipv4_udp_ntp 
549 packets (0%) 


ether_ipv4_tcp 
9741917 packets (62%) 


CT] ether_ipv4_udp 
36848 packets (0%) 


oO ether_ipv4_tcp_http 
85787 packets (0%) 


ether_ipv4_udp_nbdgm 
100 packets (0%) 


CT] ether_ipv4_tcp_cvs 
10 packets (0%) 


O ether_ipv4_udp_nbns 
240 packets (0%) 


ether_ipv4_tcp_ms_wbt 
5 packets (0%) 


ether_ipv4_udp_rtp 
F 


: 4 packets (0%) 


Dore 


S698 


__[s1b) fie fit vew Hitory Bookmare ook Hel [(GJ<[ 


cecum 
oe San a 


ll 
i ae 
ain @ ams 


oa Oe 
es 


os 


~Qea i ites 
atin etn OE 

BOO... 

OO 


This tree generated on Aug 12, 2007 at 01:25 PM 





Figure 2: NetADHICT’s primary tree view, showing traffic over a four hour period. 
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Consider, for example, the tree in Figure 1. The 
root node has a (p,)-gram of (1, 0x0393) and its child 
(p, n)-grams (9 0x70ad) and (0, 0x0100). All the offsets 
point to locations within the MAC addresses of the Eth- 
ernet frame. The offsets 0 and 1 segregate portions of 
the destination MAC address and 9 distinguishes be- 
tween source MAC addresses. The left edge signifies 
that the packet has matched the (p, n)-gram in the parent 
node. The right edge is followed for packets that do not 
contain that (p,)-gram. Please note the labeling of the 
nodes and the pie charts is not a part of the ADHIC al- 
gorithm; it is a visualization produced by NetADHICT 
and is explained in the “NetADHICT” section. 


The decision trees produced by the ADHIC algo- 
rithm have a number of strengths as representations of 
network structure. They are simple in structure and se- 
mantics, facilitating user understanding and analysis. 
Trees can be frozen, or subtrees removed from the 
learning algorithm, allowing users to directly modify 
the tree. Subtrees can be incrementally modified and 
augmented by users in a straightforward way. Addi- 
tional information and statistics may be easily added 
to decision trees. Finally, the ADHIC representation 
also easily lends itself to implementations like the de- 
cision tree packet classification algorithms often used 
as alternatives to TCAM [23]. 


NetADHICT 


NetADHICT’s user interface is an interactive 
AJAX-based web page (Figure 2). The primary ele- 
ment of this interface is the tree view, which allows 
the network administrator to quickly see how traffic is 
being distributed between clusters and what traffic 
each cluster represents. The tree view can update itself 
with newly available data as it becomes available, al- 
lowing network administrators to watch changes in the 
traffic’s structure as they occur. 


Traffic is shown in the tree view for a selected 
time period, as shown in Figure 3. The time period is 
normally selected relative to the present time; it can be 
set to always show the latest data available or just data 
that was available in the past. Alternatively, a time pe- 
riod can be specified directly for any time at which 
there is traffic data available. 


In the tree view, internal nodes represent a (p, 7)- 
gram operation which is displayed within each node, 
such as (6, 0x0001). By tracing a node’s ancestors up 
the tree, you can see which (p, n)-grams are or are not 
present in the node’s traffic. 


Terminal clusters show traffic volumes for the se- 
lected time period, as well as much more information 
through their size and coloring. The size of each termi- 
nal cluster represents the volume of traffic relative to all 
other terminal clusters in the tree. The sizes make it 
easy to see where traffic is distributed within the tree 
from a high level, or within a subtree at a lower level. 


The coloring of each terminal cluster represents 
the basic packet types of its traffic and the traffic’s 
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labels. Basic packet types include IP, TCP, UDP, and 
other non-IP traffic. For each cluster, the outer ring’s 
color shows what percentage of traffic for that cluster, 
for the specified time period, consisted of IP packets. 
For IP traffic, an inner ring shows what percentage of 
the traffic was TCP, UDP, or other IP packets. In Figure 
4, TCP traffic dominated but approximately 6% of the 
traffic assigned to the cluster was UDP packets and an- 
other 5% was non-IP packets (ARP in this case). 
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Figure 3: Selecting a time range for which Net- 
ADHICT will display a tree and traffic statistics. 
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Figure 4: A single cluster in NetADHICT, containing 
a large number of traffic types. 


Labels are used to name and group the semantic 
classes of traffic for a cluster. NetADHICT provides 
initial labels and colors with a simple traffic classifica- 
tion by EtherType, IP protocol, and IP port, but a user- 
defined label can be applied to each classification 
when a more precise semantic class is determined by 
the network administrator. The labels for each termi- 
nal cluster are represented through the pie charts with- 
in the rings. Each color represents a different label and 
their size represents the percentage of traffic for that 
cluster which belonged to the given label. 


Labels are created, modified, and deleted by the 
network administrator within the NetADHICT inter- 
face. By clicking on the “Edit labels” link in the top 
right corner, the label editing interface (Figure 5) is 
displayed. From here the system administrator can 
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create new labels, remove old ones, or change the 
names and colors of existing labels. 
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Figure 5: Editing the list of user-defined labels. 
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Figure 6: Details of a node in the tree. 


Detailed information for any node in the tree, in- 
ternal nodes and terminal clusters alike, can be dis- 
played by hovering the mouse over the node (Figure 
5). The details include the precise volumes of traffic 
which the node and all nodes below it represent. They 
also show what traffic classification or label each col- 
or in the node’s pie chart represents, what percentage 
of the node’s traffic fell into each label, and let you 
change the labels for the node’s traffic. 


For more detailed analysis, the traffic represent- 
ed by any cluster can be exported by NetADHICT into 
packet dump files. These dumps can then be analyzed 
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using standard network analysis tools such as Wire- 
shark. Any number of clusters can be selected for a 
packet capture by shift-clicking on them. While select- 
ed for capturing, all packets that match the clusters 
will be captured to pcap files; these files can be down- 
loaded at any time during the capture. 


The NetADHICT backend packet analyzer and 
frontend web interface both require access to a MySQL 
database. The frontend is a Ruby on Rails application 
and so requires a web server configured with support 
for such applications. A web browser is used to view 
the NetADHICT interface. 


NetADHICT’s trees are rendered using standard 
SVG and JavaScript; however, because many current 
browsers have poor support for JavaScript-generated 
SVG, the quality of NetADHICT’s interface varies 
greatly depending upon the web browser used. In fact, 
at the time of this writing the Firefox 3 alpha release 
(also known as Gran Paradiso [8]) is the only truly ca- 
pable browser for running NetADHICT. Firefox 2 is 
functional but too slow for real use. Internet Explorer 
7 and Konqueror 3 lack the SVG support required, and 
the latest Opera release (version 9.23) contains critical 
bugs in regard to its SVG support that prevent Net- 
ADHICT from functioning. While this variable perfor- 
mance is currently problematic, better SVG support is 
on the short-term development roadmap for most brow- 
sers; thus, we believe the browser compatibility issue 
will soon cease to be a significant problem for Net- 
ADHICT. 


Usage Scenarios 


We now move from describing the tool itself to 
how a network administrator would use the tool in 
several common situations. These situations are: 

1) checking normal network traffic, 

2) analyzing a flash crowd, 

3) recognizing special network usages or activities 
such as local P2P traffic, and 

4) identifying and isolating a propagating worm. 

In earlier papers [10, 15], we described several 
experiments which we used to evaluate the ADHIC al- 
gorithm. The following descriptions are not of actual 
experiments, but we draw upon our research experi- 
ence to describe how one could use the tool in a small 
network context. While we believe these usage scenar- 
ios should generalize to larger networks, further test- 
ing is required. 

Checking Normal Network Traffic 


The first usage scenario is network surveillance 
under normal conditions. The network administrator 
begins a session by examining the current tree, as in 
Figure 2. An administrator knows what a normal-con- 
dition tree and flow values (the size of the pie charts) 
look like and can immediately spot anomalous behav- 
ior. If he sees unusual behavior in some part of the tree 
he can magnify the subtree and examine its volume 
pie charts to see what is wrong. 
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Alternatively, he can view the tree for specific 
past time periods. By viewing the series of trees from 
the immediate past he can see how the tree developed 
to its current form. We have found this “‘movie”’ to be 
very useful in understanding network behavior. It is 
analogous to animated radar maps — but instead it 
shows network weather. 


As we will discuss in the other usage scenarios, 
anomalous traffic flows will lead to easily identifiable 
changes in the tree structure or the volume pie charts. 
However, if the tree visualization is not enough to 
identify the problem he can hover the cursor over a 
node to see the statistical network traffic volume sum- 
mary as in Figure 6. 


An administrator has another option if he needs 
more information than the two visualizations provide. 
He can shift-click on any number of clusters to request 
that NetADHICT record all future packets assigned to 
those clusters. These packets can then be downloaded 
as a pcap file into a tool like WireShark [3]. This al- 
lows the administrator to use any analysis tool he likes 
that reads pcap files while first using NetADHICT to 
eliminate the vast majority of uninteresting traffic. 


Analyzing a Flash Crowd 


Now let us consider what might happen if a par- 
ticular page on a website was hit by a flash crowd, or 
“‘slashdotted.”” A traditional network tool would no- 
tify the administrator that total traffic volume is up, 
when the surge occurred, and would probably identify, 
using port numbers, that HTTP is the culprit. The ad- 
ministrator might then turn to the web server logs for 
further information. 


Like a traditional tool, NetADHICT would show 
that traffic is up, that it is HTTP, and when the surge in 
traffic started and if and when it ended. NetADHICT’s 
other behavior would differ based on how long the flash 
crowd remains. If it is much shorter than the matura- 
tion period, then the flash crowd would be contained 
in the terminal nodes. These would likely be dominat- 
ed by HTTP and the administrator would have to refer 
to the web server’s logs for more information. If the 
flash crowd remained longer than the maturation peri- 
od, though, NetADHICT would begin segregating 
traffic, searching for commonalities in the packets of 
the flash crowd. 


If the flash crowd were from a particular network 
with a common prefix, NetADHICT might recognize 
it by specifying the address in the created (p, )-grams. 
Or, it might find (p, 7)-grams that matched other parts 
of the header or even payload contents such as the re- 
quested URL. The key thing to remember is that Net- 
ADHICT will keep refining its classification of in- 
coming packets so long as certain nodes continue to 
receive more than their “fair share” of packets, and so 
long as there exists (p,)-grams that match a signifi- 
cant fraction of a high traffic node’s packets. 


The resulting cluster equations (encoded in the 
structure of the tree) would give the administrator 
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more information than the traditional tools before he 
dove into the web logs. Further, because NetADHICT 
can provide exemplars of the “new” packets, a net- 
work administrator would have a significant head start 
in determining the precise characteristics and origin of 
the flash crowd. 


Identifying P2P Traffic 


Peer-to-peer (P2P) traffic is in some ways similar 
to flash crowds. The differences lie in that many P2P 
protocols are not well documented and many actively try 
to evade traditional traffic classification. For example, 
many BitTorrent clients now provide the option to en- 
crypt their data so that payload inspection cannot identify 
the protocol [20]. In addition to encrypting its payload, 
BitTorrent does not use standard ports. Because of 
this, traditional tools often either cannot classify or mis- 
classify P2P traffic that use these evasion tactics. 


NetADHICT does not need to know about proto- 
col structure, so it can be more useful than traditional 
tools in such situations. In our evaluation experiments 
[10] we found that ADHIC, when confronted with 
(unencrypted) BitTorrent as a new protocol, segregat- 
ed the traffic into just two terminal clusters. These 
clusters corresponded to the UDP tracker packets and 
the TCP content packets. Thus, we believe that a net- 
work administrator, when investigating a new high 
network volume application, would see one or a few 
nodes with increased traffic. If the P2P session lasted 
several maturation periods, NetADHICT would begin 
splitting those nodes, building a P2P dominated sub- 
tree. Other than the port-based color labeling, Net- 
ADHICT cannot classify traffic, but it could provide a 
pcap file of P2P traffic with all other traffic filtered out 
for the administrator to investigate. 


NetADHICT application-level classifications are 
more robust than those produced by traditional port- 
classifying tools. For example, suppose that a P2P 
client uses port 80 for its traffic. Traditional tools would 
likely mis-classify the new application traffic as HTTP. 
In experiments we have found that NetADHICT, be- 
cause it often ignores ports, continues to cluster P2P 
traffic correctly even when clients use misleading ports 
[10]. 

Identifying Worm Traffic 


Finally, let us consider the case of a propagating 
worm. Worms are a subcase of the flash crowd be- 
cause they involve a large increase in traffic over pro- 
tocols that are already used. The benefits of Net- 
ADHICT are similar to those during a “‘slashdotting.” 
If a particular set of addresses is responsible, Net- 
ADHICT may allow the administrator to discover it 
by knowing the (p,n)-gram offsets. Similarly, Net- 
ADHICT might also create new clusters specific to 
worm traffic. Inspection of the (p, n)-gram equation 
might allow the administrator to construct a signature 
for the worm, which could then be applied to a fire- 
wall or an intrusion prevention system. Even if the 
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(p,n)-gram equations could not be used for this pur- 
pose, the pcap file NetADHICT provides would be 
useful in signature construction. 


We believe NetADHICT would be useful in many 
additional scenarios. These, however, hopefully provide 
a useful introduction in how the tool could be used. 
NetADHICT, as a complement to existing tools and 
systems, can allow an administrator to quickly ascer- 
tain the state of his network and investigate any 
anomalous behavior as it occurs. 


Discussion 


While our experience with NetADHICT has con- 
vinced us that it is a useful tool for network adminis- 
trators, we have also experienced its limitations. First 
and foremost is ADHIC’s inability to classify packets. 
ADHIC is a clustering algorithm that does not rely on 
prior knowledge to segregate traffic into clusters — 
therefore the clusters may not always have the seman- 
tic splits that administrators may prefer. ADHIC’s use 
of (p,n)-grams also is a source of problems. Some in- 
teresting semantic splits do not lend themselves to us- 
ing (p,n)-grams. The reason for this is that some iden- 
tifiers are not at constant offsets from the beginning of 
the packet. These weaknesses are mitigated by Net- 
ADHICT’s incorporation of a traditional classifier, its 
ability to work with other network analysis tools using 
pcap files, and the administrator’s ability to label the 
cluster tree. 


The extent to which these inherent limitations 
will affect NetADHICT’s usefulness is not clear. To 
this point we have not investigated NetADHICT’s be- 
havior beyond our own laboratory’s network. We have 
determined that NetADHICT can often segregate en- 
crypted and multimedia traffic, by only using their 
packet headers. If traffic moves to more evasive strate- 
gies such as overloading common protocol ports, play- 
ing with the other header fields, and using encrypted 
payloads, NetADHICT may be less able to discover 
patterns in network traffic. 


Evaluation of NetADHICT is difficult because it 
uses full packet payloads. Using full payloads intro- 
duces privacy concerns. This has not been an issue for 
our internal lab, but we could not investigate Net- 
ADHICT’s behavior on other, larger, networks be- 
cause of our inability to correlate the generated trees 
with the raw pcap traces. 


Our limited ability to evaluate NetADHICT has 
helped motivate us to release this tool. We can ad- 
vance our research and improve it with more adminis- 
trators testing NetADHICT on their own networks. 


The tool continues to improve, even without an 
external user base. We are improving the integrated 
traditional classifier. In addition, we would like to give 
users the ability to control the tree through manipulat- 
ing or locking nodes as it grows. This may improve 
an administrator’s ability to find, label, and track 
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semantic classes of traffic. We are also investigating 
different user interfaces and visualizations of the data. 


Finally, we are working on moving the cluster 
and sampling portions of NetADHICT into the Linux 
kernel. This would greatly improve efficiency and in- 
crease throughput. Also, the kernel implementation al- 
lows us to schedule or filter at a per cluster level, al- 
lowing us to use NetADHICT to actively manage net- 
work traffic to improve resource allocation and miti- 
gate malicious activity [15]. We see such extensions as 
a fruitful area for future research. 


Availability 


NetADHICT is licensed under the GNU General 
Public Licence (GPL), version 2 or greater. It is avail- 
able for download at the CCSL software website at 
http://www.ccsl.carleton.ca/software . 


Conclusion 


NetADHICT shows great promise in aiding ad- 
ministrators in understanding network behavior. It pro- 
vides a new way to separate traffic that is normal from 
“interesting” traffic that an administrator is interested 
in analyzing. With new ways to visualize traffic and 
providing a network “weather” map, NetADHICT al- 
lows administrators to see the status of their whole 
network at a glance, while also providing ways to in- 
vestigate smaller flows of traffic. While development 
and testing are ongoing, by acting as a bridge from 
higher level analysis tools to low level ones, Net- 
ADHICT has the potential to improve how administra- 
tors manage network resources. 
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ABSTRACT 


Accurate performance testing of heterogeneous distributed systems, such as those created 
using GRID technology, requires a consistent method for retrieving system performance data from 
multiple platforms. This paper presents CAMP?: a low-level platform independent performance 
data API designed for use with distributed testing frameworks. 


CAMP is not necessarily tied to the distributed testing task: it provides a simple, low-level 
interface into operating system performance data that can be used to build complex performance 
measurement applications. This paper discusses CAMP’s functionality and implementation in 
detail. It also contains a detailed analysis of the API’s correctness, performance, and overhead. 


Introduction 


Performance testing is a critical part of the dis- 
tributed system development cycle, and there is a clear 
need for robust, automated, and reusable testing mech- 
anisms. This task is made difficult by several factors, 
and each must be individually addressed and resolved 
for a system to be generally applicable to an apprecia- 
ble portion of the set of distributed systems. 


Test beds often consist of heterogeneous systems 
composed of different platforms and capabilities. For 
purpose-built systems, this can be intentional. How- 
ever, for some system developers, this type of test bed 
may be the only resource available. For GRID-based 
systems, this is inevitable. 


With the increasing popularity of Java and other 
virtual machine or interpreter driven languages, appli- 
cation code is often made “incidentally portable;”’ that 
is, the final system does not necessarily need cross- 
platform capabilities, but it gains that ability through 
the features of the host language. This provides a 
unique opportunity for distributed system developers 
to test code on expanded sets of test beds, including 
the previously mentioned heterogeneous systems. For 
developers to exploit this advantage, a performance 
testing framework must be able to operate seamlessly 
and transparently on several platforms, and it must not 
provide a burden to developers who are developing 
otherwise platform-independent code. 


Aside from platform independence, a testing 
framework should be able to measure data that is both 
relevant and meaningful. Unfortunately, distributed 
metrics are difficult to generalize. For example, devel- 
opers of distributed agent systems might be concerned 
with end-to-end processing time of requests, while a 
developer of a distributed computation system might 
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be interested in the efficient and total utilization of 
available network resources. However, certain metrics 
do lend themselves to standardization: system perfor- 
mance counters on individual nodes. A platform-inde- 
pendent method of accessing performance data on 
individual nodes could serve to supplement or even 
build larger, domain-specific metrics. 


The source of system performance data varies. 
Some solutions make use of highly accurate hardware 
performance counters [21], which are dependent on 
the underlying architecture of the tested systems. In 
addition, the interfaces to these hardware systems usu- 
ally require extension of the host operating system, 
making the framework doubly dependent on both the 
operating system and the underlying architecture. While 
this provides very accurate data for specific systems on 
a system-wide level, it is difficult to extrapolate isolated 
data for a single process. 


At a higher level, modern operating systems pro- 
vide interfaces to performance data, both on the system- 
wide and per-process levels. These interfaces can take 
the form of a well-defined API, a high-level system of 
performance counting, or simply exposed kernel struc- 
tures that contain performance information. Operating 
system performance interfaces often make use of hard- 
ware performance counters as well, further strengthen- 
ing their accuracy. In addition, several operating system- 
dependent metrics are of use to the distributed system 
developer. These can include virtual memory usage, 
page fault counting, and network interface statistics. 


This paper introduces CAMP: a cross-platform 
low-level API for measuring system performance, de- 
signed for use with distributed testing frameworks. 
CAMP is grounded on a single proposition: modern 
operating systems function similarly, and they keep 
track of their performance data in some externally- 
accessible way. CAMP standardizes access to this data 
through a cross-platform interface, fully encapsulating 
the available operating system performance reporting 
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services. It is implemented across three major plat- 
forms in Python, a platform-independent interpreted 
language. The API and its respective semantics are 
identical across each platform. While not a testing 
framework in itself, CAMPmakes a valuable contribu- 
tion to the task by solving the system data indepen- 
dence problem at the lowest level. 


CAMP provides a common entry point and re- 
trieval format for system-wide and per-process perfor- 
mance data. It is implemented completely statically, 
holding no persistent state. It attempts to provide raw, 
unprocessed data wherever possible. The implementa- 
tion is substantial, production-ready, and uses the high- 
est-performing, lowest-overhead method available on 
each platform to provide data. CAMP provides novel 
solutions to several implementation-specific problems. 
These include addressing a process consistently across 
multiple operating systems, encapsulating time-aver- 
aged data in a single, stateless function call, and 
addressing locally-named devices in a platform inde- 
pendent manner. 


Possible Usage Scenarios 


CAMP is a versatile interface: it aims to be the 
foundation of a performance measurement system and 
can be used as is or easily extended. It is usable in a 
variety of different scenarios. 


Derived Functions 


The data provided by CAMP is raw in nature; that 
is, it has no meaningful calculation performed on it. 
For example, network connections are measured by 
total numbers of packets sent and received rather than 
a rate. CAMP’s ability to provide this low-level data 
across multiple platforms allows developers to create 
platform independent secondary functions. 


For example, a developer needing rate or through- 
put data for network connections is free to implement 
the function in any way: her or she can choose how 
the state is stored, what statistical functions will be 
used to determine the rate, and how often the system 
will be polled. The developer is guaranteed reliable 
and consistent raw data from the CAMP interface. This 
extension would not be tied to a specific platform. By 
selectively determining what extra processing and state 
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are used, the developer minimizes the overhead of the 
performance measuring system. A sample rate func- 
tion appears in Figure 1. 


System Monitoring 


CAMP could also be used in a wide scale client- 
server monitoring system. The API’s versatility and 
consistency would allow implementation of a variety 
of cross-platform monitoring systems. System analysts 
might be interested in passive performance logging 
systems, where CAMP’s data could be used to make 
decisions about expanding capacity or more efficient 
allocation of resources. System administrators could 
find derived functionality like active throughput and 
capacity meters useful: spikes or drops in measured 
data could quickly indicate problems. 


Using CAMP all of these systems could be built 
with little knowledge of the inner workings of the 
tested systems’ kernels. The simplistic model and imple- 
mentation in a clear, scripting-like language allows non- 
developers to intuitively collect desired performance 
data. 


Distributed System Testing 


Accurate operating system performance data 
would be a valuable addition to an existing distributed 
testing framework, and CAMP would be able to pro- 
vide this functionality with minimal overhead. Many 
existing testing frameworks rely on the distribution of 
timestamped, domain-specific performance data [7, 
11, 18]. In this scenario, timestamped data from CAMP 
or CAMPderived functions could be collected along 
with the existing data and correlated by time. This 
could provide powerful insight for the diagnosis of 
performance problems observed on a global level: 
developers could precisely quantify their software’s 
effect on various operating system services during 
periods of poor performance. 


Other systems concern themselves with treating 
distributed systems as black boxes [1]. These systems 
test only externally measurable values like global la- 
tency and throughput of provided services. CAMP pro- 
vides a low overhead, non-intrusive way of supple- 
menting this domain-specific data with system perfor- 
mance data. The API allows developers to maintain this 
black box model by utilizing the available operating 


O1 + Calculate the current transfer rate (Both 
02 # in and out) for the given network adapter. 


03 def BulkByteTransferRate(adapter): 


05 dt Query CAMP for the initial state. 

06 initial = GetNetBytesSent (adapter) 

07 initial += GetNetBytesRecvd (adapter) 

09 time.sleep (SAMPLE_INTERVAL) 

11 4 Query CAMP for the final state. 

12 final = GetNetBytesSent (adapter) 

13 final += GetNetBytesRecvd (adapter) 

15 return (final - initial) /SAMPLE_INTERVAL 

Figure 1: A derived function that measures network throughput. 
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systems to probe processes externally rather than instru- 
menting the running code. 


Design and Architecture 


The architecture of a performance test can vary 
across different project domains, and it can evolve as a 
single project grows over time. Because of this, the 
architecture must be flexible; that is, it must be appli- 
cable to any of the aforementioned scenarios. To 
accomplish this, the interface must be created at the 
lowest level possible. Figure 2 shows the CAMP API in 
relation to the surrounding implementation levels. 


CAMP’s design uses a simple, low-level stateless 
interface. To the user, all performance data can be 
thought of as existing in a set of permanent counters, 
and the API merely provides simple accessors to the 
data. The API is a set of solid, well-tested building 
blocks that form the foundation of a platform-indepen- 
dent performance monitoring application. 


The design is largely inspired by declarative lan- 
guages like SQL. The data is assumed to exist in a 
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changing but directly inaccessible state, and the meth- 
ods for accessing the data are thread-safe and consis- 
tent for a single point in time. CAMP focuses on pro- 
viding the simplest, most intuitive interface without 


Performance measuring applications, 
testing frameworks 
Possible CAMP extensions: derived functions, 
stateful counters 
CAMP 
Low-level, stateless performance functions 


Win32(pdh) | Solaris(kstat) 


Figure 2: The CAMP system architecture. 








prematurely optimizing it by forcing it into the mold 
that best suits the performance data access patterns of 
a particular platform. As discussed in our empirical 
evaluation, this does take a toll on performance in some 
cases. The simple, atomic request-response architecture 
allows a focus on correctness and usability: CAMP 
defines distinct function contracts, and its proper use is 


GetPercentProcessorTime(cpu) 
GetFreePhysicalMemory() 


GetNetBytesSent(interface) 
GetNetPacketsSent(interface) 
GetNetBytesRecvd(interface) 
GetNetPacketsRecvd(interface) 


GetNumReads_Disk(disk) 
GetNumWrites_Disk(disk) 
GetNumReads_Partition(partition) 
GetNumWrites_Partition(partition) 


GetNumPageFaults(process) 
GetCPUTime_Total(process) 
GetCPUTime_User(process) 
GetCPUTime_Kernel(process) 
GetWorkingSetKb(process) 
GetVMSizeKb(process) 
GetThreadCount(process) 


EnumDiskPartitions() 
EnumPhysicalDisks() 
EnumNetworkInterfaces() 
GetCPUCount() 
GetProcessldentifier(pid) 
GetProcessldentifiers(name) 





Global Functions 


Global CPU usage. CPU parameter may be omitted. 
Global free physical memory 


Network Functions 


Total bytes sent on interface 

Total packets sent on interface 
Total bytes received on interface 
Total packets received on interface 


Disk IO 


Number of read operations on the given disk 
Number of write operations on the given disk 
Number of read operations on the given partition 
Number of write operations on the given partition 


Per-process Functions 


Number of major and minor page faults 
Process CPU utilization 

Process user mode CPU utilization 

Process privileged mode CPU utilization 
Size of the process working set in KB 

Size of the used virtual address space in KB 
Number of threads contained in this process 


Enumeration Functions 


Enumerates the available disk partitions 

Enumerates the available physical disks 

Enumerates the valid inputs to the network functions 
Returns the number of CPUs in the system 

Returns a “process identifier” for the given pid 
Returns a “process identifier” for each running 
process launched from an executable of the given 
name 


All CPU-time functions have an optional second parameter: the sample interval. 


Figure 3: The current API. 
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not defined by a set of unnecessarily complex access 
patterns. 


Cross-Platform Coverage 


For CAMP to be complete, it should be able to pro- 
vide substantial coverage of each implemented plat- 
form’s performance statistics. This presents several 
challenges: some platforms provide more information 
than others, while some platforms simply provide dif- 
ferent information or different levels of granularity. 


CAMP makes use of a lowest common denomina- 
tor design. This approach, described in [5], is taken by 
the Java Abstract Windowing Toolkit (AWT) [22]. 
The Java AWT takes the intersection of natively-avail- 
able GUI components on several platforms and gener- 
alizes them under a single interface. While effective, 
this method does limit the functionality to that of the 
least functional platform. However, it is guaranteed to 
be fully consistent in any implementation. 


While this intersection design has the potential to 
severely limit functionality, CAMP is still able to pro- 
vide a comprehensive API. As discussed previously, 
CAMP is based on the postulation that modern operat- 
ing systems function similarly and record similar data. 
While this data may be difficult to access, the common 
set of accessible data was more than enough to provide 
a usable API. The current API appears in Figure 3. 


Performance 


CAMP is designed to be “as fast as possible,” 
and this involves the elimination of some design alter- 
natives — namely, the ability to leverage preexisting 
native utilities to shorten CAMP’s development time. 
Most? of CAMp’s functionality can be collected from 
the output of a native, command line driven utility on 
the host. However, this is a potentially costly layer of 
indirection: the forking of a process to satisfy a single 
function call, which may be called several times per 
second, puts an unnecessary strain on operating sys- 
tem resources and is very time consuming. 


This overhead may be acceptable for simple sys- 
tem monitoring tasks, as demonstrated by Eddie [12], 
but this level of resource consumption is not satisfac- 
tory for a high-performance API that intends to form a 
foundation for performance measuring applications. 
Instead, CAMP makes use of the fastest, most direct 
native interfaces into the kernel performance data for 
each implemented platform. It does not make use of 
any other utilities or services. 


Implementation 


CAMP’s public interface is implemented in the 
Python language, and its internals consist of a mix of 
Python and native code. It currently supports the 
Win32 (Windows NT/2000/XP), Linux 2.6, and Solaris 
(SunOS kernel 5.x) platforms. The Win32 implementa- 
tion interfaces to Microsoft’s performance data handler 
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(PDH) interface. In the Linux implementation, CAMP 
interfaces with the proc filesystem. On Solaris, CAMP 
uses both the proc filesystem and the kernel statistics 
chain (kstat). 


This chapter discusses each platform’s imple- 
mentation in detail, and it also chronicles several 
global issues that affected all platforms. 


Python 


Python is an open source, interpreted language 
that is available for most modern platforms. It is char- 
acterized by its unusually clear syntax and strong set 
of libraries. Python lacks end-of-statement delimiters, 
forced declaration of variables, and explicit types. It 
also uses the concept of “significant white space,” 
with indentation actually indicating the nesting level 
of a particular statement. These features lend to Py- 
thon code’s readability. Python combines the simplis- 
tic syntax and excellent text processing capabilities of 
a scripting language with the robustness and maturity 
of a full, heavyweight programming language. 

CAMP uses Python for a number of reasons. 
Python has a fully wrapped interface to the Win32 
interface using the pywin32 package. This included a 
wrapper around Microsoft’s performance data handler 
interface, which was able to provide nearly all perfor- 
mance measuring functionality in the Win32 imple- 
mentation. Python is also ideal for text processing 
tasks. Containing a full regular expression matching 
and replacement implementation, Python lessened the 
difficulty of parsing the cryptic output of the proc 
filesystem. 


Windows 


Microsoft does provide a low-level interface to 
performance information: the HKEY_PERFORMANCE_ 
DATA registry branch. However, it is not visible to the 
user and requires direct programmatic interaction with 
the registry hive. When using the registry directly, 
access is not error-checked or thread safe. Incorrect 
use may provide false data rather than an exception or 
error. Instead of accessing this data directly, Microsoft 
recommends the use of the performance data handler 
interface. 


The Win32 performance data handler interface is 
a high-level encapsulation around the concept of per- 
formance data gathering. It revolves around the con- 
cept of a “counter,” which is an object that is attached 
to particular performance “‘concept”’ and can be called 
to periodically gather data about that concept. To 
retrieve data, the user calls either a raw or formatted 
data retrieval function on the counter. The Windows 
‘“*Performance”’ control panel administrative tool dem- 
onstrates the direct use of this interface. 


Windows provides counters for performance “ob- 
jects.” Examples of objects include “Memory,” “‘Pro- 
cessor,” “‘Paging File,” and “Physical Disk.” Each 
object is associated with one or more “‘counters.” For 
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example, the ‘‘Memory”’ performance object contains 
around twenty counters, including instances such as 
““Pages/sec,” “Available Bytes,” and ““% Committed 
Bytes in Use.” 


Some counters are associated with “formatted” 
data, which involves a computation on raw data. 
““Pages/sec” is an example of one of these: the counter 
itself contains a raw count of the number of pages writ- 
ten to and read from disk, but retrieving the counter 
value causes an average rate to be calculated. One can 
bypass this calculation by retrieving the “raw” perfor- 
mance data from a counter. Other, scalar counters like 
“Available Bytes” return the same value for both for- 
matted and raw outputs. CAMP almost exclusively uses 
the raw data, which may have its ultimate source in 
the NT kernel (in the case of process information) or 
in an individual device driver (in the case of network 
or disk IO). 


A counter can optionally be associated with an 
“instance.” For the counters in the object “Process,” 
the set of instances consists of the set of running pro- 
cesses. For the counter ““Network Interface,” the set 
of instances consists of the set of installed network 
adapters. Microsoft provides a function to enumerate 
the set of instances for a given counter, which CAMP 
uses to implement each of the enumeration functions. 


The Microsoft performance data handler inter- 
face is designed for higher-level usage than that pro- 
vided by CAMP, but there appears to be little overhead 
in using its clear, relatively simple interface. 


Linux 


On the Linux platform, CAMP collects operating 
system performance data through the proc pseudo- 
filesystem. proc is a file-like interface to kernel data 
structures that show system information, which in- 
cludes performance data. At the top level, the filesys- 
tem contains three types of entries: status files, kernel- 
specific directories, and process directories?. The proc 
filesystem is referred to as a pseudo-filesystem be- 
cause the “files”’ presented are actually file-like inter- 
faces into kernel data structures which reside com- 
pletely in memory or are generated dynamically. As a 
result, access to the filesystem is exceptionally fast. 
Several files are redundant: human-readable versions 
are supplemented by simple, one-line white space 
delimited versions that can be parsed more quickly. 


The status files contain useful information about 
the system configuration: information about the CPU, 
mounted filesystem, attached devices, and memory. It 
also contains an accurate uptime count, which lists the 
current system uptime and how much of that time has 
been spent in the idle process. 


The kernel-specific directories are groupings of 
status files. For example, the net directory contains 
3Note that this breaks from the traditional UNIX model of 


only keeping process information in proc. This greatly aided 
CAMP’s development on Linux. 
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information about each protocol in use as well as raw 
performance data for each network adapter. In addi- 
tion, when granted root-level access, several of these 
files are writable. Modifying one of these “files” 
causes the modification of the related kernel data 
structure. 


CAMP uses these kernel-specific files for imple- 
menting all global functions. In many cases, the data 
provided by proc is already in a sufficient format. In 
some cases, however, the data requires manipulation. 
For example, most memory measurements in Linux 
are stored as page counts. CAMP converts these to a 
raw size by adjusting the value based on the results of 
the POSIX getpagesize call. 


The process directories contain useful informa- 
tion about each running process. They are named by 
the process identification number (pid) of the related 
process. A process directory contains a symbolic link 
to the executable that spawned the running process, 
information about memory and CPU usage, a list of 
open file descriptors, and varied information about the 
process’s environment. 


The information provided by proc fit well into the 
CAMP interface. It is relatively low-level, accurate, 
and can be accessed with little overhead. 


Solaris 


Performance data on Solaris comes from two 
sources: the proc filesystem and the kernel statistics 
chain (kstat). The Solaris proc filesystem differs greatly 
from the Linux implementation, following a more stan- 
dard UNIX pattern. It exports performance data only 
for processes, not system-wide statistics, and the files 
contain binary data that must be read within a C-com- 
patible language and cast to the appropriate struct 
type. This effectively precluded direct access from 
Python. To solve this problem, CAMP uses its own C 
to Python shared library that handles all data retrieval, 
marshaling, and error checking. Once again, reads on 
the proc files are atomic and unbuffered to prevent 
data inconsistency. 


Global system data comes from a large data 
structure within the SunOS kernel called the kstat 
chain. The data structure consists of a linked list of 
performance counters, each of which can be one of 
several types: a set of name/value pairs, a binary C 
structure, or an undefined “raw” type. The data is cate- 
gorized hierarchically by “class,” “module,” instance 
number, and name, but this organization is not reflected 
structurally; that is, no matter what the search criteria, 
finding an appropriate kstat instance is always an O(n) 
operation. 


A research-quality implementation of a Python 
to kstat bridge already existed [2], but it was not com- 
patible with the current version of the SunOS kernel 
and it was incapable of retrieving many of the P 
“*kstat” instances that CAMP needed. Building on this 
system, CAMP generalized the solution to all instances 
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and introduced compatibility with modern kernels. 
CAMP’s version of the library also includes a faster 
search algorithm: the original author made several 
passes of the chain when one would suffice. This 
library also includes several bug and data type conver- 
sion fixes, and it also adds the ability to produce enu- 
merations of all statistics structures of a certain class 
or module. 


With these enhancements, all global and enumer- 
ation functions were able to be implemented with a 
relatively simple interface. As in the Linux implemen- 
tation, memory data in page units had to be converted 
to a byte count. 


Implementation Issues 


This section presents selected implementation 
issues and their respective solutions. 


Raw Data on Windows 


The pywin32 package that encapsulated the Win- 
dows performance data handler interface was incom- 
plete. The existence of the “raw” retrieval function 
was not implemented in the Python wrapper DLL, and 
its existence was not acknowledged or documented. 


CAMP includes an extended version of the 
pywin32 interface that includes this missing function- 
ality. CAMP uses Microsoft’s documentation and head- 
er files as a strict contract for converting the raw 
native values into appropriately typed Python objects, 
taking special care to avoid narrowing conversions 
that may cause loss of data. 


Determining CPU Usage 


Determining CPU usage from a single function 
call posed a unique problem. At any instant in time, the 
usage of a CPU in a time-shared operating system is 
either zero or one hundred percent. Introducing a persis- 
tent polling thread into the system was not a viable 
option; it would violate the design goals of the interface. 


To solve this problem, the CPU functions query 
the current idle and uptime counters immediately, block 
for a specified interval, and query a second time and 
return. This causes no additional performance over- 
head: the task can be unscheduled while blocked, and 
there is no thread-spawning overhead. The sample rate 
is configurable at call time by applying an optional sec- 
ond parameter to the call indicating the block interval. 
This value is set at 100 ms by default, and has proved to 
be very accurate while keeping the function responsive. 


The utilization is calculated as: 
(i a idlefinat = idle initial 
uptime final — UptiM initial 
or, more simply, the percentage of time not spent in the 
idle process. The per-process CPU usage functions are 
implemented similarly but with a finer granularity, as 
both kernel time and user time are reported separately. 


Another issue arose with multi-CPU systems. 
Camp should provide the ability to distinguish load on 
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a single CPU from global system load, so the global 
CPU measuring function allows an optional CPU pa- 
rameter: a 0-based index that requests the load for a 
specific CPU. Omission of the parameter causes the 
total system load to be returned. CAMP also provides a 
CPU count function to provide information when the 
count is not known. 


Enumeration Functions 


The networking and disk functions must operate 
on a per-interface and per-disk or partition level to be 
useful. However, device names are certainly not con- 
sistent across systems, and indexes, while consistent, 
are arbitrary and not meaningful. 


One solution would be to use common device 
names to reference the devices. The logical candidate 
for this would be to map Linux’s standard eth* hd/sd 
naming scheme into Windows. However, this would 
provide an unnecessary burden and layer of confusion 
for Windows developers. In addition, the mapping of 
English device names to a universal “‘code” could 
prove to be nondeterministic, forcing developers to 
use empirical tests to determine what “‘code”’ refer- 
enced the relevant network adapter or disk. 


CAMP’s solution is to use enumeration functions. 
In this particular context, the function enumerates the 
list of installed network adapters and the list of in- 
stalled disks. However, the semantics of the function 
is stronger than it appears on the surface. Every output 
of the enumeration function is guaranteed to be a valid 
input to the performance measurement functions; this 
means that the output will not only contain the correct 
name, but it will also be in the correct format required 
by the underlying platform. 


The Windows build of CAMP uses Microsoft’s 
PDH object enumeration function. On Linux, this was 
implemented by parsing a file in the proc filesystem 
and building a tuple of the results. On Solaris, this 
data is gathered over one traversal of the kstat chain. 


Process Addressing 


The standard method of addressing a function in 
an operating system is the process identifier, or pid. 
However, Windows does not have an O(1) method for 
programmatically accessing a process’s performance 
counter based on its pid; one must enumerate all pro- 
cesses and look for a match. This would be an unrea- 
sonable solution for CAMP: because the API does not 
maintain a state, each call to a per-process function 
would take O(n) on a Windows machine (where n is 
the number of currently running processes). 


On Linux and Solaris, the opposite is true. Get- 
ting process information by pid is achievable in O(1) 
time, but finding a set of pids from a name is an O(n) 
operation. 


CAMpP’s solution to this is to introduce an abstract 
concept: the “‘Process Identifier.” The developer can 
retrieve a process identifier through a CAMP function 
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that takes a pid as its input. Because this function ide- 
ally only runs once per session of use, the O(n) run- 
ning time is acceptable. 


On the Windows side, this function is imple- 
mented by enumerating all current processes and find- 
ing the matching process. CAMP then creates a process 
identifier, which, on this platform is a string, accord- 
ing to Microsoft’s naming conventions. The identifier 
consists of the name of the executable binary with its 
extension truncated, and an increasing numeral ap- 
pended for processes spawned from the same binary. 
CAMP returns an error value if the process is not 
found. 


The Linux and Solaris functions are much sim- 
pler: they merely attempt to open the process’s status 
file from its proc directory. If it fails for any reason 
(existence of the process or lack of permissions), an 
error value is returned. 


CAMP does not attempt to hide the value of a 
process identifier in an abstract class. If a developer is 
aware of the identifier type for the current platform, he 
or she can bypass the process identifier step. 


For convenience, CAMP also provides a function 
that returns a list of process identifiers given an exe- 
cutable file name. This function returns a list because 
multiple processes can be running from the same exe- 
cutable — which may often be the case in a distributed 
system. For example, the prefork variant of the Apache 
web server creates multiple processes to handle child 
requests. A developer can simply run GetProcessIden- 
tifiers(“httpd’) to get an enumerable list of all running 
Apache processes. 


Result Validation 


The objective of the result validation phase was 
to ensure that each performance function reports rea- 
sonable and predictable behavior under the presence 
of controlled conditions. CAMP’s test plan involves 
running the performance monitoring functions against 
small programs that use a specific resource in a mea- 
surable amount. 


Correct implementations of these programs would 
require kernel augmentation on all three platforms, as 
the operating system is inherently in control of the 
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distribution of resources. In spite of this, our programs 
were able to provide a reasonable approximation for 
CPU, memory, network, and thread usage. 


The test bed consisted of a high-end PC4 running 
VMWare workstation. Each test ran sequentially on 
three virtual machines: Microsoft Windows 2000 Pro- 
fessional (kernel 5.0), SuSe Linux 9.3 (kernel 2.6.11), 
and Sun Solaris 10 (kernel 5.10). Each operating sys- 
tem ran only its essential services to minimize interfer- 
ence with tests. The virtual machine configuration 
limited each to 256 MB of RAM and a 32-bit virtual 
processor. 


CPU Time Verification 


To test the various processor time functions, 
CAMP was set to monitor a simple Python program 
that generates a fixed load on a single CPU. Put sim- 
ply, this program divides a discrete interval into an 
“on time” and and “off time” based on the desired 
load, and alternates between busywaiting during the 
“‘on time” and sleeping during the “off time.’’ The 
code for this function appears in Figure 4. 


01 def waste(amount): 
02 on_time = _INTERVAL* (amount/100.0) 


03 off_time = _INTERVAL-on_time 

04 ctime = clock() 

05 while True: 

06 if clock() - ctime >= on_time: 
07 sleep (off_time) 

08 ctime = clock() 


Figure 4: The CPU load generation code. 


The global CPU time function and the per- 
process equivalents were first individually tested against 
idle, 50% and full system loads. Next, both functions 
were tested against two processes, each consuming 
approximately 30% of the CPU. The optional “sample 
rate” parameter was omitted, leaving a default sam- 
pling period of 100 ms. 


All reported values are averages of 200 samples 
taken at a 0.5 second interval. Each average is the cen- 
ter of a 95% confidence interval on the mean, given 
that the overall sample distribution is normal. The 
algorithm used is described in [10]. The range of each 


4The host PC has a 64-bit Athlon processor, 2 GB of main 
memory, and two 10,000 RPM SATA disks. 
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confidence interval never exceeded 0.5% and does not 
cause a visible difference on the scaled graphs. 


The results of these tests on appear in Figure 5. 
All platforms reported the idle and full loads without 
error, and each platform was able to provide accurate 
results for all other intermediate loads. The global 
measurement of the two-process test read slightly 
above the expected value; this is most likely due to 
scheduling overhead. 


Memory Usage Verification 


The memory usage tests operated on the func- 
tions GetWorkingSetKb and GetFreePhysicalMemory. The 
test program is a simple, two-stage Java program. It 
first launches and immediately allocates a 976 KB 
array on the heap and waits. Upon continuing, it allo- 
cates another 976 KB array and waits until terminated. 
Table 1 lists the results of the two CAMP functions 
before and after the allocation on all three platforms. 


CAMP reported consistent working set results for 
the program on all three platforms, albeit with a 4KB 
disparity on Linux. The Linux virtual machine’s page 
size is 4KB, so this represents a reasonable difference 
of just a single page. 

However, while the working set function reported 
consistent results, it did not report the 976 KB alloca- 
tion precisely. Because Java programs run in a virtual 
machine that handles allocations on behalf of the run- 
ning code, precise changes in memory usage, from a 
program perspective, are not necessarily exactly re- 
flected at the physical level. In practice, however, the 
difference between the expected and actual values was 
just over 1% when using these short-lived test pro- 
grams. This was sufficient for verification. 


The free physical memory validation was some- 
what more informal. A precise test for this function is 
impossible to execute in user space as the full working 
sets of all tested kernels seldom converge on a com- 
pletely steady state. This is due in part to each operat- 
ing system dynamically controlling paging and vari- 
ous caches through periodically-running background 
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threads. Instead of a precise test, this measurement 
was a verification that this function reasonably re- 
flected the expected changes in free physical pages. In 
all cases, CAMP’s free memory function reported a 
value within 10% of the expected result. 


Network Utilization Verification 


CAMP’s network functions report the cumulative 
bytes and packets sent and received on a single inter- 
face. To verify these values, the functions were run 
against a Java program that sent and received a fixed 
number of packets and terminated. Each test packet 
consisted of a plain, addressed UDP packet with a 32 
byte payload. UDP traffic was ideal for this test be- 
cause it allowed a precise prediction of the number of 
packets sent and received; there was no extraneous 
ACK or connection setup overhead to skew the results. 
Testing occurred on a private network consisting of 
only the CAMPtested virtual machines and their host. 


Table 2 shows the results of this test. All plat- 
forms reported identical packet counts, but Windows 
reported a different bytes received count. The offend- 
ing byte count was 600,000, which amounts to 60 
bytes per packet. All other byte counts were 740,000, 
or 74 bytes per packet, which gives a difference of 14 
bytes per packet. Several more tests with increased 
packet sizes showed that this discrepancy always 
remained at exactly 14 bytes per packet. 


The composition of each 74 byte packet is likely 
14 bytes of Ethernet header, 28 bytes of UDP over- 
head, and the 32 bytes of payload. A likely explana- 
tion for the 14 byte difference is that the network 
driver for the virtual machine’s network adapter is not 
adding the Ethernet header to its incoming byte 
counter. CAMP’s network functions reflect this dis- 
crepancy because the byte counts come directly from 
the respective network drivers. Informal tests on non- 
virtual Windows machines did not show this same 
asymmetry. 

Despite this difference, the key metrics for evalu- 
ation were the packet count and the verification that 
the byte counts were incremented by at least 320,000 


After Alloc. 





Difference 


















Work. Set Size 9132 10120 988 
Phys. Free 416244 415272 (972) 
(a) Windows 
Before Alloc. After Alloc. Difference 
Work. Set Size 14128 [S112 984 
Phys. Free 10196 9324 (872) 
(b) Linux 
Before Alloc. After Alloc. Difference 
Work. Set Size 13096 14084 
Phys. Free 387440 386472 (968) 
(c) Solaris 


Table 1: Memory allocation results. 
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bytes (payload = packetcount). In addition, the test pro- 
grams verified that the payload was delivered in full. 


Thread Count Correctness 


Verifying the thread count function was a straight- 
forward task. The function GetThreadCount recorded 
the current thread count of a Java program running 
one extra thread. After a fixed amount of time, the 
program spawned a second thread. The thread count 
was recorded once again. The results of this test 
appear in Table 3. CAMP recorded the proper increase 
in thread count on each platform. 






Before After 


launch launch Difference 





Windows 9 10 l 
Linux 9 10 l 
Solaris 9 10 l 


Table 3: Thread count test results. 


This test could have been potentially incorrect. 
The Java Language Specification does not guarantee 
that each conceptual thread is backed with a native 
thread. However, in practice, the Sun-provided Hot- 
Spot virtual machine implementation behaves in this 
manner on all three platforms. Surprisingly, each plat- 
form showed the exact same thread count — which 
would suggest that the virtual machines are struc- 
turally similar. 


Informal Verification 


The previous sections contain strictly quantified 
test results that confirmed CAmpP’s functionality. How- 
ever, they were not comprehensive — they had no ref- 
erence to the enumeration functions, the disk IO func- 
tions, or the page fault count. Due to the difficulty in 
generating predictable results, these functions were 
part of an informal test plan. 
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The disk IO functions are difficult to test consis- 
tently. The disk IO data provided by CAMP comes 
directly from the respective disk drivers, so this data 
does not reflect an operating system-level layer of 
abstraction. There is no direct relationship between a 
user-level read or write and an actual physical disk 
action. For this reason, the disk functions were veri- 
fied informally. On all platforms, the counters were 
integral, increasing, and strictly monotonic. They also 
increased their respective rates of increase during peri- 
ods of high disk activity, making them suitable for a 
derived rate function. 


The page fault function was also difficult to test. 
Instead of a direct analysis, CAMP’s output was veri- 
fied to match that of a comparable native utility on 
each platform. 


On Windows, Microsoft provides an optional util- 
ity in the Windows Resource Kit called pstat, which 
provides a crude version of UNIX ps. This utility was 
able to provide a page fault count for comparison. On 
Linux, the standard ps command is able to provide a 
page fault count with the switch -O min_fit,maj_fit. Sun 
does not provide a utility with Solaris 10 to access the 
page fault count of a single process. However, a compa- 
rable utility, pio [9], was able to provide the needed data 
to verify CAMP’s function. 


The values returned by the enumeration func- 
tions are by definition platform and system-dependent, 
making them impossible to test quantitatively. Instead, 
the functions were tested informally on an expanded 
test bed, which included a 64-bit Linux production 
server, two quad-processor 64-bit Solaris servers, a 
Windows XP x64 desktop, and the three original vir- 
tual machines. 


Testing the enumeration functions involved man- 
ually collecting a list of required results in a platform- 











Before Send AfterSend Difference 
Packets Sent 37412 47412 10000 
Bytes Sent 2175911 2915911 740000 
Packets Received 58299 68299 10000 
Bytes Received 13001375 13601375 600000 
(a) Windows 
Before Send AfterSend Difference 
Packets Sent 60968 70968 10000 
Bytes Sent 4157951 4897951 740000 
Packets Received 112454 122454 10000 
Bytes Received 154652907 155392907 740000 
(b) Linux 
Before Send AfterSend Difference 
Packets Sent 1762 11762 10000 
Bytes Sent 160145 900145 740000 
Packets Received 3379 13379 10000 
Bytes Received 386700 1126700 740000 
(c) Solaris 


Table 2: Network test results. 
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dependent manner. On all six test systems, the CPU 
count was known beforehand and simple to verify. On 
Windows, correct values for the network, disk, and 
partition enumeration functions are available in the 
“Device Manager” administrative tool. 


On Solaris and Linux, the command netstat -i pro- 
vided a correct list of network interfaces suitable for 
comparison. Disk and partition lists were available as 
symbolic links in the device node directories. This 
information was available in /dev/disk/by-id/ and /dev/dsk 
on Linux and Solaris, respectively. 
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Figure 6: Execution time results. Functions are grouped 
by similar implementations and averaged. From left 
to right: Windows, Linux, and Solaris. 


Performance Evaluation 


This section details the tests used to measure the 
API’s execution speed and overhead. The first set of 
tests precisely measured execution speed and CPU 
usage distribution throughout the implementation com- 
ponents. The second test measured the overhead of a 
simple CAMP-based monitoring program on a running 
system. The final test measured the impact of the same 
monitoring program on a production web server using 
externally-derived statistics. 


Timing and Profiling 


This section contains a low-level, direct measure- 
ment of CAMP’s performance overhead. Python pro- 
vides a timing package, timeit, which takes a small snip- 
pet of code as an input and aggregates several iterations 
of it into a single execution unit. timeit then runs that 
larger unit several times, timing it using the highest res- 
olution clock available on the host platform. 


Timing runs consisted of a total of 30,000 runs 
per function: three separate execution runs of 10,000 
invocations. To measure the overhead of the CPU 
functions in their purest form, the sample rate was set 
to zero to eliminate the normal blocking. 


For brevity, each “execution unit” comprised 
groups of similarly implemented functions rather than 
individual calls. For a function to be “similarly imple- 
mented,” it must access the same class of performance 
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data. A trial run on all individual functions confirmed 
that the grouped functions had near-identical execu- 
tion times. 


Figure 6 displays the results of these timing runs. 
All CAMP functions executed quickly, with the slowest 
function still allowing for over 100 invocations per 
second. 


The Linux functions all executed at an order of 
magnitude faster than the Solaris or Windows counter- 
parts. The Windows functions all displayed relatively 
significant overhead, but the disk counters were sur- 
prisingly fast to access. The Solaris functions behaved 
as expected: The disk and network functions both exe- 
cuted in around the same amount of time as they both 
traverse the same chain for their target data, while the 
CPU functions must traverse the chain twice and per- 
form some calculations. Solaris per-process functions 
make use of the proc filesystem, so execution time is 
comparable to that of Linux. 
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Figure 7: Relative execution time results. 


To put these execution times in perspective, Fig- 
ure 7 contains the same information as Figure 6, but 
the graph has been scaled against the execution time 
of running a native process within Python. As dis- 
cussed previously, this is an approach taken by other 
utilities and was an implementation alternative for 
CAMP. On Linux, the timing includes forking a second 
process, executing the command ps -f, and collecting 
its output. On Solaris, the timing performed a similar 
action using the kstat command line utility. No such 
command line utility is built in to Windows.5 CAMP’s 
native access of performance data clearly allows a 
much higher level of performance. 


Measured Overhead, or CAMP on CAMP 
Next, we used CAMP to measure the overhead of 
a CAMP-derived monitoring program. 


The first task involved a tunable monitoring pro- 
gram, probe. This program calls nine CAMP functions 


5The aforementioned pstat tool from the Microsoft Re- 
source Kit is parameterless, which forces the full enumera- 
tion of every process and all of its performance attributes. Its 
execution time is on the order of seconds. 
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at a time, with the entire nine-function block being 
executed at a specific real-time sample interval. To 
implement the sampling frequency as accurately as 
possible, this monitoring program times the execution 
of the nine-function block at startup and adjusts the 
sleep interval accordingly. 


The tests consisted of running probe at increasing 
sample rates and measuring the impact on the system 
at each level using another instance of probe. While the 
results of every CAMP function were included in the 
measurements, the only significant differences mani- 
fested themselves in the global and per-process CPU 
functions. However, this exercise did confirm that 
CAMP is indeed stateless and without memory or 
thread leaks on all implemented platforms. The results 
of this test on the three implemented platforms is 
shown in Figure 8. CPU values are an average of 20 
samples taken over a 10 second period. 


These results were in line with the values re- 
corded by the timing functions. On Solaris and Win- 
dows, CAMP produces little overhead up to 10 func- 
tion blocks/second (90 functions/second). On Linux, 
CAMP produces little overhead at all measured sample 
rates. 


Impact on an Operational System 


The final performance test consisted of measuring 
CAMP’s overhead from an externally accessible statistic. 
This test is important for several reasons. First, the 
results of the previous test could have been partially 
skewed by using CAMP to test itself — caches could 
have been kept artificially hot. Second, the concept of 
“CPU utilization” is a somewhat coarse statistic that 
does not always directly map to actual or perceived per- 
formance. Third, CAMP may cause performance im- 
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pacts in the kernels of the respective platforms in a 
non-obvious way that may skew results. Lastly, it 
allows a complete distrust of the operating system- 
provided performance data. 


For this experiment, the target application was a 
web server. Each virtual machine was configured with 
Apache 2.0.53. The Linux and UNIX machines used 
the prefork multiprocessing module (MPM), and the 
Windows machine used the default Windows NT MPM. 
Each Apache installation used the default MPM configu- 
ration parameters, including initial process and thread 
count. 


The test involved running the probe program 
from the previous test at the same sample intervals, 
but instead of recording system performance data on 
the server, Hewlett Packard httperf [13] collected http 
client response statistics from an external machine. 


To generate a normal load, httperf created a total 
of 1400 connections to each web server at a rate of 40 
connections/second. This rate produced an approxi- 
mate 15% load on the Linux and Solaris machines and 
a 30% load on the Windows server. The results of this 
test are presented in Figure 9. 


CAMP only produced a measurable impact on the 
Windows server. It appeared to take a progressively 
increasing hit as probe’s sample rate increased. CAMP 
did not affect the performance of the Solaris and 
Linux servers at this sample rate. This was somewhat 
unexpected, given that CAMP’s performance on Solaris 
was comparable to its Windows performance. A likely 
explanation lies in the implementation of the Apache 
MPM on each platform. The Windows version of 
Apache handles all client connections from within a 
single, multi-hundred-threaded process. In terms of 
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Figure 9: Results of the operational system impact test. 
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process scheduling, this lone process has the same 
weight and priority as probe. This causes CAMP’s im- 
pact to be more pronounced, even at relatively low 
sample rates. On Solaris and Linux, Apache prefork 
responds to the increased load by forking additional 
processes, which causes the Apache system as a whole 
to be scheduled more often than probe. 


Use Cases 


CAMPmon 


CAMPmon is a simple Python application that 
demonstrates the use of the CAMP API. It monitors a 
selected group of networked workstations, each CAMP- 
supported, and records the global CPU and memory 
usage to a server console application. 


The workstation client program is simple: it mere- 
ly loads Python’s lightweight XML-RPC server and 
registers the relevant CAMP functions as network- 
accessible. The server console program reads in an 
XML file with a list of accessible workstations and 
queries them at a specified quantum using the Python 
threading package. This data is then aggregated on the 
server console and displayed as formatted text. 


Testing was performed across eight workstations: 
four booted into Fedora Linux and four booted into 
Windows XP. The server monitoring application was 
launched from a Windows workstation. The system 
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performed as expected, continuously and accurately 
reporting each workstation’s performance information. 


This simple implementation is intended to be a 
proof-of-concept display of CAMP’s potential. Only a 
single client application was used for both platforms, 
and the server was unaware of the underlying imple- 
mentation on each workstation. 

A Two-Platform Evaluation of Apache 

Two identical servers were configured with Win- 
dows 2000 Server and SUSE Linux 9.3 (kernel ver- 
sion 2.6), respectively. Apache 2.0.53 was loaded on 
each. The Windows server was configured to use the 
default NT MPM with 300 threads, and the Linux 
server was configured with the worker multi-process- 
ing module (MPM), also configured with 300 initial 
threads. 

A Python script was written to monitor each 
server. At a half-second interval, it wrote a log file con- 
taining network traffic counts and aggregate process 
resource usage data about the collective Apache pro- 
cesses. This script was ran individually on both plat- 
forms. 

A third server was configured with a build of 
Apache flood. Each server was tested with an identical 
configuration: 250 clients, connecting in groups of 10 
at one second intervals. Figure 10 contains the results 
collected by CAMP. 
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Figure 10: Results of Apache evaluation. 
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The network functions graph shows that each 
server was given a comparable load. Each server both 
sent and received the same amount of data. The CPU 
utilization graph was not unusual; the IO-bound Apache 
process did not heavily tax the CPU on either platform. 


The memory graphs, however, were more telling. 
The Linux build of Apache with the worker MPM was 
able to run in a much smaller footprint, and, more 
importantly, was able to respond to the increased load 
quickly and without a large increase in either its work- 
ing set or frequency of page faults. The Linux server’s 
working set rose a small amount at the instant the load 
test began and remained steady throughout, while the 
Windows server took nearly twenty seconds to reach a 
steady state. During those twenty seconds, the Win- 
dows server was constantly causing page faults. This 
performance discrepancy could have been caused by a 
combination of configuration differences and imple- 
mentation variations in the platform-dependent Win- 
dows and Linux MPMs. 


Related Work 


PAPI [3] is a system that shares several design 
goals with CAMP, but it accomplishes a slightly differ- 
ent task: it provides a platform-independent interface 
into hardware performance counters rather than oper- 
ating system counters. On most desktop architectures, 
the underlying architecture is capable of providing 
controllable metrics about processor and low-level 
cache events. Each architecture has a different set of 
assembly instructions for accessing these counters, 
and each set has different semantics. 


PAPI provides a platform independent interface 
to these counters through a standard C and FORTRAN 
interface, but it requires operating system kernel aug- 
mentation on most common platforms. PAPI does pro- 
vide accurate metrics: its hardware counter interface 
has been used to validate other benchmarks [15]. 


Unlike CAMP, performance data provided by PAPI 
is difficult to correlate with a specific process. Utilities 
that allow this behavior either ignore the effects of 
scheduling or instrument the operating system kernels 
to report scheduling events [14]. 


Many system monitoring solutions are imple- 
mented using the SNMP protocol [23]. It provides a 
simple request/response protocol for system monitor- 
ing and management. SNMP is an application level 
protocol, implemented using UDP, that allows clients 
to host a set of named data. Examples of managed data 
include host name, uptime, and load. Because the pro- 
tocol is at the application level, the local client imple- 
mentation (the SNMP agent) is in full control of what 
data is reported and how it is obtained. 


The latest version [17, 8] of the protocol pro- 
vides bulk request/response functionality that scales 
more reliably with large distributed systems. When 
used with common object names, SNMP is capable of 
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providing a common interface into operating system 
performance data by means of a common network 
packet. However, the SNMP agent that runs locally on 
the monitored machine and services requests must still 
be able to actually collect the system performance 
data, which in most cases must be accomplished with 
native code. CAMP can provide a simple method for 
implementing a platform-independent SNMP agent that 
provides performance data for a set of managed hosts. 


The Gloperf system [11] is part of the Globus 
GRID computing toolkit [6]. By using the GRID frame- 
work, it is able to measure selected performance statis- 
tics in a platform independent manner. The client side 
of this system runs as a daemon, and it actively collects 
its own statistics rather than broadcasting operating sys- 
tem or network protocol statistics. It uses a sensor/col- 
lection model in which the user installs a “sensor” on 
the client to be probed and periodically collects data. 
CAMP follows a completely different model: it only 
reports data that is already available on the host. 


The Tau [20] set of tools is a high performance 
computing testing framework that has been recently 
updated to collect full statistics from a Java Virtual 
Machine [19], introducing platform independence. This 
has been used to create a Java profiler that can selec- 
tively instrument and measure parallel and distributed 
Java applications. The tool’s feature set is comprehen- 
sive, and its standard interface allows it to be used on a 
number of platforms. However, all measurement is con- 
fined to within the virtual machine — Tau cannot mea- 
sure system-level statistics. 


CoMon [16] is a monitoring layer for the Planet- 
Lab testbed [4]. It collects data from individual Planet- 
Lab distributed nodes, and its concept of a “‘node-cen- 
tric daemon”’ can deliver nearly all of the performance 
data that CAMP would provide. However, this tool 
only works with the PlanetLab operating system (a 
modified version of Fedora Core Linux), which makes 
the research less useful for systems outside of that 
controlled domain. 


Future Work and Conclusions 


CAMP’s most important task is its continued imple- 
mentation on other Python-supported platforms. Many 
UNIX-like operating systems are able to support both 
Python and the performance information necessary to 
support CAMP, including the newest Macintosh operating 
system. 


There are endless possibilities for derived, sec- 
ond-layer functions. However, a standard set, includ- 
ing functions like rate calculators and aggregate net- 
work traffic, is feasible. 


Distributed system performance testing is a broad 
problem, and CAMP seeks to solve the lowest level 
issue: CAMP allows developers to collect performance 
data from multiple platforms in a consistent, correct, and 
predictable manner. 
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Software Availability 


CAMP can be obtained at: http://wiki.csc.calpoly. 
edu/camp. 
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ABSTRACT 


An operating system’s readahead and buffer-cache behaviors can significantly impact 
application performance; most often these better performance, but occasionally they worsen it. To 
avoid unintended I/O latencies, many database systems sidestep these OS features by minimizing 
or eliminating application file I/O. However, network traffic measurement applications are 
commonly built instead atop a high-performance file-based database: the Round Robin Database 
(RRD) Tool. While RRD is successful, experience has led the network operations community to 
believe that its scalability is limited to tens of thousands of, or perhaps one hundred thousand, 
RRD files on a single system, keeping it from being used to measure the largest managed networks 
today. We identify the bottleneck responsible for that experience and present two approaches to 
overcome it. 


In this paper, we provide a method and tools to expose the readahead and buffer-cache 
behaviors that are otherwise hidden from the user. We apply our method to a very large network 
traffic measurement system that experiences scalability problems and determine the performance 
bottleneck to be unnecessary disk reads, and page faults, due to the default readahead behavior. We 
develop both a simulation and an analytical model of the performance-limiting page fault rate for 
RRD file updates. We develop and evaluate two approaches that alleviate this problem: application 
advice to disable readahead and application-level caching. We demonstrate their effectiveness by 
configuring and operating the world’s largest! Multi-Router Traffic Grapher (MRTG), with 
approximately 320,000 RRD files, and over half a million data points measured every five 
minutes. Conservatively, our techniques approximately triple the capacity of very large MRTG and 


other RRD-based measurement systems. 


Introduction 


Sometimes common case optimizations by the 
operating system can adversely affect an application’s 
performance instead of improving it. For instance, in 
most OS, readahead intends to optimize sequential file 
access by reading file content into buffer-cache with 
the expectation that it will soon be referenced. In this 
paper, we identify and remedy a situation in which the 
performance of a popular time series database, the 
Round Robin Database (RRD) Tool, is adversely 
affected by the default OS readahead and caching 
behaviors. 


We present an investigative method to discover 
the RRD system’s performance bottleneck and an 
analysis of the bottleneck identified: the OS default 
file readahead and caching behavior. We describe two 
approaches to optimize system resource usage for 
maximum performance: (i) application advice to the 
OS to disable readahead and (ii) application-level 
caching. We validate our results by configuring and 
operating a Multi-Router Traffic Grapher (MRTG) sys- 
tem that performs over a half a million measurements 


‘Based on the authors’ experience in the MRTG and RRD- 
Tool user community 


every five minutes and records them into a set of 
320,000 RRD files in near real-time. We identify the 
additional factors that limit further scalability of the 
RRD system after these improvements. We also discuss 
OS improvements to the readahead behavior that could 
generally avoid the application performance problem 
we observed. 


We investigate the scalability issues in a real 
world scenario. During the deployment of new net- 
work equipment (routers and switches) over the past 
few years at our university, the number of manged 
devices grew significantly, nearly doubling each year. 
This required our network measurement system’s 
capacity to scale similarly. Today, for approximately 
60,000 measured network interfaces, about 160,000 
RRD files need to be updated every five minutes. 
These record interface byte, packet, and error rates. As 
the number of measurement points grew with the net- 
work size, we found that an increasing number of 
measurements did not get recorded into the database 
within the required five minute interval (20 to 80 per- 
cent failures). 


We are motivated to study the scalability issues 
of RRD for two main reasons: (7) we were confounded 
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by our system’s poor performance given that it is gen- 
erously-sized with respect to processor, memory, and 
disk, and (ii) any performance gains achieved would 
benefit many, given the popularity of RRDTool. To the 
first point, our prior understanding of RRD file struc- 
ture and access patterns led us to believe the amount 
of work should not overwhelm our system. RRD files 
are organized in such a way that a small number of 
blocks are accessed per update cycle. The set of 
blocks in a series of updates has a low entropy: that is, 
most updates touch the same set of blocks. Thus, we 
were of the opinion that the “working set” of blocks 
for the RRD files in our system could reside com- 
pletely in the OS file buffer-cache. Unexpectedly, our 
system’s CPU spent the majority of its time in an “I/O 
wait”’ state due to disk reads. 


To study the state of the buffer-cache, we wrote a 
tool called fincore that exposes the cache “footprint” 
of a given set of files; it takes a snapshot of the set of 
file blocks or pages in the buffer cache. This helps us 
determine at any given time what pages were brought 
in memory by the OS and helps us discover the reada- 
head effects. We are also able to study the average 
number of pages per file brought into the memory by 
an unmodified MRTG. This helps us determine the 
maximum number of RRD files the system could han- 
dle with fixed hardware resources. We wrote another 
tool called fadvise that can advise the operating system 
about the file access pattern using the posix_fadvise 
system call. This tool enables the user to forcibly evict 
any file’s pages from the buffer-cache, providing a key 
function in controlled experiments. 


Our work makes the following contributions: 
e We provide two tools and a methodology to 
study buffering behavior. These enable a sys- 
tem administrator or analyst to study the buffer- 
cache of any system (provided it implements the 
requisite APIs) and draw conclusions about 
readahead behavior, cache eviction policies, 
and system capacity. 
We develop an analytical model and simulation 
that determine the number of RRD files that 
can be managed given fixed memory resources 
or to determine the memory required for man- 
aging a given number of RRD files. 
We present two optimizations to RRDTool and 
evaluate their performance and scalability. The 
first employs application-level buffering or cach- 
ing to coalesce file updates. The second offers 
application advice to the operating system that 
RRD files are accessed randomly rather than 
sequentially, thus causing readahead to be dis- 
abled. 


The remainder of this paper is organized thusly: 
We first provide background on MRTG and RRD, and 
introduce our network measurement system. Next, our 
investigation technique is described in the “Method 
and Tools” section. Then two complementary perfor- 
mance optimizations to RRDTool are presented in 
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the and “‘Application-Offered Advice’ sections. The 
subsequent “Analysis” section contains our analytical 
model and simulation details. Ultimately, in the ‘“‘Scal- 
ability” section, we report the scalability of the opti- 
mization techniques by running what we suggest is the 
world’s largest MRTG on a single server. Therein we 
also discuss the factors limiting the further scalability 
of RRDTool after these improvements. The “Related 
Work”? and “Discussion and Future Work” sections 
follow and we close with our conclusions. 


Overview of MRTG and RRD 


The Multi-Router Traffic Grapher (MRTG) is a 
perl script that collects network measurements and 
stores them in RRD files. Figure 1 shows a simplified 
MRTG in pseudo-code form. MRTG performance is 
satisfactory as long as it can consistently complete one 
loop iteration, consisting of one “poll targets” and one 
“write targets”’ phase, in less than the update interval, 
typically five minutes (300 seconds.) 


# read configuration file 
## to learn targets 
readConfiguration(); 
do { 
## POLL TARGETS: 
## collect values via SNMP: 
readTargets(); 
## WRITE TARGETS: 
## update values in RRD files: 
foreach my $target (@targets) { 
RRDs: :update(...); 
} 
sleep (...); # sleep balance 
df of 300 seconds 


} while (1); # forever 
Figure 1: The MRTG daemon in pseudo-code. 


MRTG refers to the configurable metrics it col- 
lects as Targets. Each target consists of two objects 
collected via SNMP, typically one inbound and one 
outbound measurement for a given network interface, 
i.e., a router or switch port. Thus, the number of tar- 
gets per network device is typically a function of its 
number of interfaces. 


The paired objects are each referred to as a Data 
Source (DS) in a Round Robin Database (RRD). The 
RRD file name itself and the file’s Data Sources 
define the database “‘columns.” Round Robin Archives 
(RRAs), or tables of values observed at points in time, 
are the database “rows.” Figure 2 shows a typical 
RRD file managed by MRTG. 


RRD performance is influenced by the RRAs 
defined within a file. Each RRA has an associated con- 
solidation function, such as AVERAGE or MAX, that 
operates on a set of one or more Primary Data Points 
(PDPs), i.e., data points collected at the measurement 
interval. Additional RRAs typically require additional 
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work to be done periodically, such as on every half 
hour, two hours and one day. These aggregation times 
are defined as offsets from zero hours UTC. Thus all 
like-configured MRTG RRD files require aggregations 
to be done every half hour, more every two hours, and 
then the most aggregations at midnight. 


RRA 0 
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Figure 2: A typical MRTG RRD file and update oper- 
ation. This RRD file stores two Data Sources 
(DSes) in eight Round Robin Archives (RRAs): 
four AVERAGE and four MAX RRAs. The 5 


minute AVERAGE and 5 minute MAX RRAs are 
being updated. 


Our MRTG System 


Our MRTG system use RRDTool and currently 
measures approximately 3,000 network devices. Pri- 
marily, the devices are switches and routers in our 
campus network including those in the core and distri- 
bution layers and most of the network equipment at 
the access layer, serving users in approximately 200 
campus buildings. 

In this work, we refer to this production MRTG 


network measurement system as the System Under 
Test (SUT.) The SUT’s characteristics including its 
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software are summarized in Table 1.2 The system’s 
page size is 4KB and our file-systems are configured 
with a 4KB block size. Thus, we will conveniently use 
the terms “‘block” and ‘“‘page’’ interchangeably when 
referring to segments of a file whether they are on disk 
or in memory. 








Operating System | Linux 2.6.9 
File System | ext3 and ext2, 4KB blocksize 
V/O Scheduler 


Version 
MRTG | mrtg-2.10.5 
RRDTool | rrdtool-1.0.49 


Table 1: Characteristics of the System Under Test and 
its software. 






As our centrally-manged network has grown, our 
MRTG system has grown in terms of computing power 
and storage. One significant technique we employ to 
improve MRTG’s scalability is to divide the targets 
amongst a configurable number of MRTG daemons 
that we increase as our number of targets increases; we 
process about 10,000 targets per daemon. So, our one 
MRTG “instance” is actually a collection of MRTG 
daemons running on one server. Another dimension in 
which our MRTG system is larger than most is that we 
resize RRA 0 (the five minute averages) to store up to 
one year or five years of data. This increases an MRTG 
RRD file’s size from the typical 103 KB to 1.7 MB or 
8.2 MB, respectively, of course requiring much more 
disk space. (We see that this does not adversely affect 
performance in the “‘Analysis”’ section.) 


Prior to this work, our network growth exceeded 
the scalability of the SUT. The Appendix lists system 
and MRTG configuration recommendations that we’ve 
tested and used in our system to meet our performance 
goals. 


Method and Tools 
Examining System Activity 


We started by examining the SUT’s activity to 
determine the nature and extent of the performance 
problem. We present three measurements that led us to 
the root cause of the problem and that allow us to 
evaluate potential solutions. 


First, we measured the time to completion of 
each measurement interval by each MRTG daemon on 


2MRTG 2.10.5 patched as follows: Modified fork code to 
use select as in mrtg-2.10.6. Added a --debug=time option to 
report poll targets and write targets times. Removed test for 
legacy “.log” files (log2rrd), and threshcheck. 
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our system, with approximately 160,000 targets in 
total. As shown in Figure 1, this consists of two 
phases, first polling the network statistics via SNMP 
and then updating the pertinent RRD files. Figure 3 is 
a scatter plot with the measurement’s time of day on 
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Figure 3: Original MRTG performance on the SUT 
with 160,000 targets processed by 28 MRTG dae- 
mons. The total time for poll and write targets 
phases often exceeds the five minute performance 


goal. 


CPU Utilization 


Figure 4: Original CPU utilization on the SUT with 
160,000 targets processed by 28 MRTG daemons. 
CPU I/O wait state is excessive due to page faults 
for RRD file content. 


the horizontal axis and the seconds elapsed on the ver- 
tical axis. The “‘poll time” dots show the time taken 
(in seconds) to poll the network and the “total time” 
dots signify the time taken by the poll and the subse- 
quent write phase in one loop iteration by an MRTG 
daemon. Note that all the network polling finishes 
well below 60 seconds (marked by a horizontal line.) 
The writing phases very often do not finish within the 
period of 300 seconds (our five minute performance 
goal, also marked with a horizontal line) for the dae- 
mons; some even take 10 minutes to complete. This is 
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clearly unacceptable performance because it delays 
the measurements for the subsequent poll phase in the 
single-threaded MRTG daemon, resulting in missing 
measurements. 
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Figure 5: Original Disk utilization on the SUT with 
160,000 targets processed by 28 MRTG daemons. 
Block read operations unexpectedly exceed write 
operations on RRD file updates. 


Further examination reveals that the CPU was in 
I/O wait state for a majority of the time. Figure 4 
shows the CPU utilization. The user and system CPU 
utilization levels are ~20% and ~10%, respectively, 
and are not the bottleneck. However, the CPU is 
spending more than half its time in the I/O wait state. 
Thus, the CPU wastes most of the time waiting for I/O 
to complete. 


To understand the I/O wait, we studied the actual 
number of reads and writes involving the disk. Unex- 
pectedly, the system was doing close to 90,000 reads 
per second (see Figure 5.) In contrast, the number of 
writes was stable at ~12,000 writes per second. The 
high number of reads suggests that files are not being 
cached effectively. This led us to examine the contents 
of the buffer-cache to determine why. 


Examining Buffer-Cache Content 


The system’s buffer-cache content gives a good 
indication of which files’ accesses are benefiting from 
caching in core memory. Unfortunately, a system’s 
buffer-cache content is generally hidden from users 
and user processes. Prior work has resorted to timing 
file block accesses to surmise whether or not a given 
page already resides in the buffer-cache memory [4]. 
While suitable in some situations, this technique is 
indirect and has the unwanted side-effect of modifying 
the cache because it references the pages about which 
it inquires, causing them to be brought into cache and 
likely evicting other pages. Thus, a user tool to pas- 
sively investigate buffer-cache content is desired. 

We introduce a new user command called fincore 
that is used to determine which segments of a file are 
in core memory, presumably because they reside in the 
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buffer-cache. The fincore command takes file names as 
arguments and displays information about file blocks 
or pages in memory. The fincore command uses two 
common system calls to accomplish this: mmap and 
mincore. That is, it first maps a file into its process’ 
address space, then asks which pages of that segment 
of the process’ address space are in core at that time.? 
Using fincore we were able to uncover the readahead 
effects on the buffer-cache. As an optimization, Linux 
reads pages ahead from the disk into the buffer, antici- 
pating locality of subsequent reads. This improves per- 
formance for most applications by decreasing subse- 
quent read latencies. With the current implementation 
of RRDTool, the readahead can have a highly adverse 
impact on performance and scalability. A brief discus- 
sion of the readahead algorithm within the context of 
the RRD file shown in Figure 2 can make this clearer. 


The readahead algorithm tries to guess whether 
the file being accessed is going to be read sequentially 
(when readahead is actually useful) or randomly. On 
an RRD file update, the first read is for the meta-data 
at file offset zero, i.e., the beginning of the file. The 
maximum readahead window size is 32 blocks for 
ext2 and ext3, and, on an initial read, the readahead 
window starts at half that maximum in anticipation of 
sequential access. So, 16 pages are read into buffer- 
cache when the application read just one. The file off- 
set of the second block needed to update the AVER- 
AGE RRA depends on the current update position 
within the RRA. 


For the typical MRTG RRD file, this will lie 
within the first 16 pages. The file offset of the third 
block needed for the MAX RRA update sometimes 
lies beyond the first 16 pages which can lead to further 
8 pages being read in to the memory. (Eight pages are 
read as the readahead algorithm reduces the readahead 
window at the random seek into the MAX RRA.) 
These file block accesses are depicted in Figure 6. 


A typical RRD file update consists of two RRA 
updates (AVERAGE and MAX). With the default 
readahead, most blocks that are read in are unneces- 
sary. In the event of data kept over a longer period of 
time, as in our case with five minute averages for one 
year or five years, the write for the AVERAGE RRA 
often lays well beyond the first 16 pages. The reada- 
head window is reduced to 8 pages for the next ran- 
dom read for the AVERAGE RRA update and then to 
4 for the subsequent random read for the MAX RRA 
update. The readahead algorithm starts to adapt to the 
random reads by reducing the readahead window. A 
typical RRD file update requires just three block 
updates, yet we end up bringing 28 (16+8+4) blocks 
into the file cache. The file is then is closed which 
causes the adapted readahead value to be lost, revert- 
ing to 16 the next time the file is opened. For the 


3fincore is not entirely passive; it likely affects the cache 
slightly because it opens the file and thus causes an access to 
its inode block. 
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typical RRD file with 800 recorded values, we end up 
bringing almost the full file into cache. If we could 
bring just the required ‘“‘hot” blocks into the file cache 
by suppressing readahead from the beginning, we 
would get better performance and scalability. 


read 
et pee 33533333333353333 
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Figure 6: A sample RRD file update operation and the 
blocks involved. Readahead causes many blocks 
to be read unnecessarily, rather than just the 
“hot” blocks. 


We see that the number of blocks most often 
required per MRTG RRD file per update is three. 
Based on our fincore observations (in the ‘‘Analysis” 
section), we note that there is also a low “‘churn”’ rate 
of these blocks. That is, once fetched into memory, 
these blocks were useful for a long period of time. If 
we were caching only the required blocks, there would 
be no waiting on reads, eliminating our performance 
problem. 


Due to the default readahead behavior, we must 
wait for reads from the disk since we find almost noth- 
ing useful in cache. This is because, for instance for 
the file cache to accommodate 300,000 RRD files, 
300,000 x 24 x 4 KB (close to 30 GB of memory for 
the cache) are required. Since the file cache isn’t that 
large, the page replacement policy evicts the pages 
that will be needed later. However, with suppressed 
readahead, we would cache just three blocks per file: 
300,000 x 3 x 4 KB (3.6 GB) and an order of magni- 
tude less memory is required to fit everything desired 
in file cache. Actual buffer-cache behavior for RRD 
files is not quite as simple as this example; see the 
“Analysis” section for details. 


21st Large Installation System Administration Conference (LISA °07) 67 


Application Buffer-Cache Management for Performance... 


Evicting Buffer-Cache Content 


For repeatable experiments involving the buffer- 
cache, we require fine-grained control over the buffer- 
cache content. For instance, one run of an experiment 
such as an RRDTool update, brings pages of the RRD 
file from the disk into the buffer-cache. If we wish to 
see the effects of a subsequent update (reading the 
pages again from the disk), we need to evict the pages 
that were brought in earlier. Generally, the only meth- 
ods available to forcibly evict pages from the buffer- 
cache were to either (i) unmount the file-system con- 
taining the cached files or (ii) populate the cache with 
hotter pages by accessing other content more fre- 
quently or more recently, thus invoking the systems 
page replacement algorithm to evict the unwanted 
pages. To perform controlled experiments we wanted a 
more convenient method for a user to forcibly evict 
specific files’ blocks from the buffer-cache. To do so, 
we introduce a new user command called fadvise that 
is used to provide file advisory information to the 
operating system. The fadvise command takes file 
names as arguments. 


Our typical use of fadvise is to advise the system 
that we “‘don’t need”’ a file’s blocks and that we’d like 
them to be evicted from the buffer-cache.$ In this case, 
the file is first synchronized so that its dirty pages are 
transferred to the storage device, e.g., the disk, and 
then the advice is issued. The fadvise command uses 
the fsync then fadvise system calls to accomplish this. 


Application-Level Buffering 


We’ve described our system and showed that it 
does not meet our performance goal. A number of 


4The Linux 2.6.9 source code and experimentation show 
that fadvise DONTNEED immediately evicts non-dirty pages 
from the buffer-cache. Other implementations might not 
evict the pages immediately. 


Every 5 mintes, append timestamp and 
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RRDTool users have proposed significant modifica- 
tions to RRD measurement systems to improve perfor- 
mance by modifying I/O behavior [22, 12, 9]. These 
proposals generally involve intercepting RRD file up- 
dates and recording them to be written later. The up- 
dates are thus deferred, then later coalesced and written. 
The result is improved performance by the introduction 
of an independent thread to perform application writes 
and by the better locality characteristics of these peri- 
odic, coalesced writes. 


Since this essentially implements a buffer-cache 
within the application, we call this technique applica- 
tion-level buffering. 


Technique 


We now describe our application-level buffering 
implementation called RRDCache, shown in Figure 7. 
RRDCache has three main components: 

1. The RRDCache Library: a perl module that 
handles an application’s calls to RRDTool’s 
library. Specifically, it is used in place of RRDs 
perl module and provides the same functions, 
i.e., update, graph, etc. 

2. The RRDCache Journal Buffer: a tmpfs file- 
system [24] to which updates are temporarily 
stored sequentially. (This is reminiscent of a 
journal in a journaling file-system.) We se- 
lected a memory-based file-system because it 
reserves a portion of memory exclusively for 
RRD and completely eliminates disk I/O dur- 
ing the RRD update operation.® 

3. The RRDCachewriter: a script scheduled hourly 
using cron that periodically organizes updates 
and applies them to the RRD files on disk. In 


5The RRDCache journal buffer need not be a memory- 
based file-system; it could be a disk-based file-system and 
still yield improved performance due to the better locality 
characteristics of appended writes to files. 


Once an hour, flush ramdisk 
cache to Disk Array. 


RRDCachewriter 


Disk Array 










Client wants trending data, 
doesn't need last hour. 


Client uses rrdtool normally. 


Figure 7: An Overview of RRDCache. 
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this way, RRDCachewriter performs both asyn- 
chronous writes and a sort of I/O scheduling on 
behalf of RRDCache and the application using 
it. This side-stepping of the operating system’s 
default behavior, such as flushing dirty buffers 
to disk every five seconds, results in perfor- 
mance gains. 


Normally an MRTG daemon (or any application 
that uses RRD files) accesses RRD files on disk 
directly by using the RRDTool API. With RRDCache, 
MRTG and other applications instead call functions in 
the RRDCache library. RRDCache presents the same 
API as the RRDTool. So, when an MRTG daemon 
calls the RRDCache:update function, the arguments 
are appended to a RRDCache journal file associated 
with calling process, i.e., the MRTG daemon. 


The RRDCache journal file is located on the 
tmpfs file-system, eliminating disk I/O for the update. 
Periodically (once every hour), the RRDCachewriter 
runs to process any new data that has been written to 
the RRDCache journal files. The RRDCachewriter 
handles the updates from the journal files by commit- 
ting the update to the appropriate RRD files on disk. 
In the process, it coalesces all the updates meant for a 
particular RRD file. If the requisite file pages are not 
already present in the buffer-cache, this has the benefit 
of bringing them into memory much less frequently. 
The RRDCachewriter can be run more often if the 
tmpfs runs out of space between runs. (We have been 
using 1 GB of our main memory for the tmpfs file-sys- 
tem and that has proven to be sufficient for the 
160,000 targets polled at five minute intervals.) 


Performance Impact 


The performance of the measurement system 
using RRDCache is much improved. For our MRTG 
system, Figure 8 shows the results. We see that the 
MRTG daemons finish in well under 60 seconds 
which includes both the polling and writing. Contrast 
this with Figure 3, in which most of the updates were 
not achieving even the five minute performance goal. 


Since the RRD file updates were performed by 
RRDcachewriter once an hour, and ordered by RRD 
file, there is a limited amount of I/O wait by the CPU 
at the start of every hour (Figure 8). This I/O wait is 
much less than the original system’s I/O wait shown in 
Figure 4. The CPU utilization by the user and system 
processes remains the same as before. 


Also evident in Figure 8 are spikes in disk read 
and write activity once per hour as the updates are 
being transfered from the journal buffer to the RRD 
files on disk. These disk I/O rates are much lower than 
the original system’s rates shown in Figure 5. 


One complication of RRDCache’s technique is 
that the application-level journal buffer is not readable 
by RRD applications other than the RRDCachewriter; 
currently it is just a buffer for writing, not a cache for 
reading. While updates reside in the RRDCache journal 
buffer, they can’t be directly accessed by applications 


Application Buffer-Cache Management for Performance... 


that may wish to graph recent measurements, for 
instance. Thus RRDCache slightly changes near real- 
time access semantics for RRD files. To work around 
this, RRDCache provides a graph function that imme- 
diately flushes pending updates from the journal buff- 
er into RRD files upon attempts to read them and then 
returns the result of the RRDs::graph function. Appli- 
cations then have the option of accessing RRD files 
directly through the RRDs interface, thus reading per- 
haps only older data suitable for trend analysis. How- 
ever, performance would degrade if applications were 
to read every RRD file once per update interval (e.g., 
five minutes) to retrieve the most recent measure- 
ments, reverting to approximately the poor perfor- 
mance originally observed. If ever this becomes a 
problem, RRDCache could be improved so that its jour- 
nal buffer is a true buffer-cache, consistent amongst 
both reading and writing processes.® 


MRTG CPU Disk I/O 


CPU Utilization 
8 6s & $ 8 
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t 
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Figure 8: RRDCache: Performance on the SUT with 
161,000 targets processed by 27 MRTG daemons. 
The five minute performance goal is easily met 
although CPU I/O wait and excessive block reads 
vs. writes are evident. 


Application-Offered Advice 


We find sufficient motivation to avoid the operat- 
ing system default readahead and buffer-cache behav- 
iors because of the latencies observed while updating 
RRD files. In the previous section, we’ve shown that 
modifications that drastically reduce RRDTool’s file 
I/O can achieve better performance by working around 
the operating system’s default behavior. 


In this section, we instead improve performance 
by directly influencing the operating system behavior. 
Specifically, we identify a mechanism to cause just the 
desired RRD file blocks to be read and cached. 
Technique 

One technique to suppress readahead is to use 
the posix_fadvise system call. This allows applications 


$Maintaining the journal buffer as a collection of very 
small RRD files would enable it to be read conveniently. 
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Figure 9: RRD with fadvise: MRTG performance on 

the SUT with 162,000 targets processed by 19 

MRTG daemons. The five minute performance 


goal is clearly met. 
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Figure 10: RRD with fadvise: CPU utilization on the 
SUT with 162,000 targets processed by 19 MRTG 
daemons. CPU I/O waits are minimal. 


to advise the operating system of their future access pat- 
terns of files. The application identifies regions of an 
open file (by offset and length) and offers hints as to 
whether it will access them sequentially (the default) or 
randomly. Additionally, the application can inform the 
OS whether or not it expects to access those file regions 
again in the near future. Thus, for RRD files, we are able 
to advise the OS that the file accesses will be “random.” 
This turns off readahead. The result is that only the 
“hot” blocks shown in Figure 6 are read and cached. For 
Linux 2.6.9 with fadvise RANDOM, on a typical five 
minute update only three blocks are cached. 
Performance Impact 

The benefit of disabling readahead is realized 
immediately. Figure 9 shows the time elapsed per loop 
iteration of each MRTG daemon. With a system of 
162,000 RRD files serviced by 19 MRTG daemons, 
we see that they finish within 60 seconds including 
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Figure 11: RRD with fadvise: Disk utilization on the 
SUT with 162,000 targets processed by 19 MRTG 
daemons. I/O Read operations are reduced to 
nearly zero due to efficient use of the buffer- 
cache. 
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Figure 12: fincore: Comparison of proportion of RRD 
files by number of blocks cached without file 
access advice (before) and with advice (after). 
With fadvise RANDOM, unnecessary RRD file blocks 
no longer occupy the buffer-cache. 


both the polling and writing phases. This is contrasted 
with Figure 3 where we see that most of the updates 
do not meet our five minute performance goal. This 
also suggests that we can monitor even more targets 
within a five minute interval. Note that the increase in 
the time to update at zero hours UTC is due to aggre- 
gation (of the AVERAGE and MAX values) that is 
synchronized across all RRD files, and a potential per- 
formance limitation that we also discuss. The perfor- 
mance potential and limitations are further explored in 
the “Scalability” section. 


As expected, the CPU is largely freed from wait- 
ing on I/O to complete (Figure 10). Again comparing 
with Figure 4, we note that the CPU utilization by the 
user level processes and the system remains the same 
as before. The significant difference is the low amount 
of waiting on I/O by the system when the readahead is 
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suppressed. This implies a lowered number of reads 
with the buffer-cache becoming much more effective 
in caching the needed blocks. This is validated by our 
measurement of the reads issued to the disk per second 
by the system (Figure 11). The performance gain is 
significant, reducing approximately 90,000 reads per 
second to about 100 reads per second. 


Figure 12 shows the plot of the proportion of 
RRD files vs. the number of blocks cached for both 
the original system without fadvise and the modified 
one with fadvise. One can see the sharp decrease in the 
number of blocks cached by the modified system with 
fadvise compared to the original system. For the origi- 
nal system, a sharp inflection point occurs at 27 pages. 
This indicates for a majority of the RRD files 27 pages 
were required. Also, the original system required more 
pages per file but couldn’t fit them in the buffer-cache. 
For the modified system with fadvise, the inflection 
point occurs at 8, where the system requires 8 pages 
for most of the RRD files. Since the buffer-cache has 
space available for more pages, some of the RRD files 
get to keep more than 8 pages in the buffer-cache. 


Analysis 


To better understand the buffer-cache behavior 
when updating a very large number of RRD files in 
near real-time, we developed both an analytical model 
and a simulation. Analytical modeling improves our 
understanding of RRDTool’s file update behavior so 
we have a solid foundation on which to propose gen- 
eral solutions. The resulting model also provides a 
convenient way to calculate expected page fault rates 
without experimentation and measurement. The simu- 
lation allows us to gather a broader set of results than 
either the model or real-world experiments that would 
prohibitively require repeated reconfiguration of a real 
system’s RRD files or physical memory. 


Analytical Model 


We present an analytical model to predict the 
page fault rate given an RRD file configuration and an 
estimated number of pages available for the each RRD 
file in the system’s buffer-cache memory. The model- 
ing is done for a system with RRDTool patched to do 
fadvise RANDOM. 


For this analytical model, we first need a list of 
the unique Primary Data Point (PDP) counts in in- 
creasing order, over which the consolidation is per- 
formed for every RRA (for AVERAGE, MAX, and so 
on). Recall that each RRA consolidates some number 
of PDPs that were gathered at the measurement inter- 
val, so an RRA’s PDP count determines how often it is 
updated. For instance, for the RRD file shown in Fig- 
ure 2, the ordered list of values of PDPs over all RRAs 
is: {1, 6, 24, 288}. These represent the periods of con- 
solidation which in the case of Figure 2 refers to 5 
minutes (1), 30 minutes (30/5 = 6), 2 hours (120/5 = 
24) and 1 day (288). We also need a corresponding list 
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of number of RRAs that are configured to consolidate 
each of those numbers of PDPs. That is for 2, since 
both MAX and AVERAGE RRAs are kept for each of 
the PDP value, the associated count list for {1, 6, 24, 
288} is'{2, 2,2, 2}. 

We denote the ordered PDP list as {x,, x9, ..., 
x,} and the associated RRA count list as {c), co, ..., 
c,}. The cardinality of the ordered list is n. Hence in 
the example, {x), x2, x3, x4} = {1, 6, 24, 288} and {c,, 
C2, C3, C4} = {2, 2, 2, 2}. Here, n = 4. 

Now, for one update of an RRA, let B be the 
number of bytes written into the RRD file. In our case 
B = 16 bytes, for the two 8-byte floating-point values. 
For simplicity, we assume that each RRA is block 
aligned. This does not sacrifice generality since the 
average number of page faults due to crossing page 
boundaries is still predicted accurately as long as an 
RRA is at least one block in size, which is typically 
the case. The block size is S bytes. S = 4096 bytes in 
our case. 


The number of updates that fit in a block is u = 
S/B. That is, after every S/B updates a page fault will 
occur. u = 256 in our case. 


Let T be the time after which the primary data 
point is updated. 7 = 5 minutes in our case. 


Now we estimate, for a single RRD file, the rate 
at which page faults occur given maximum p pages 
are available in the buffer-cache for use for this file 
(excluding the inode and indirect blocks.) When more 
than p pages are needed, LRU is used to evict pages to 
make place for newer ones. 


The time estimated to a page fault, ¢, is the fol- 
lowing: 


t= minimum(uT, x,T) (1) 
where Js such that 
Ss 
p= 12> y, C; (2) 
sl 
p-1<c¢ (3) 


i=1 
minimum() returns the lower of the two values. 


The rationale behind x,7: For each RRA, a page 
is needed in memory. So we calculate the count of 
RRAs which can be fit in p—1 pages. (Only p-1 
pages are available to the RRAs because one page is 
required for the first block of the RRD file that is 
always read as it contains the RRD metadata.) The 
subscript s is used to index into the ordered list to 
determine the interval after which the fault will occur. 
Each index into the ordered list is a discrete point in 
time when a fault will occur. For instance, if s = 2, 
then x, implies fault would occur every half hour. (2) 
and (3) give the index s based on the count of RRAs 
that can fit in p — | pages. 


The rationale behind u7: A page fault will surely 
occur after u updates. 


The number of pages that will see a fault, m, 
after time f: 
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n=83.— (4) 
i=1 Xj 
where: 
§=p-Ddc (5) 
i=l 


Rationale behind 6: The number of pages that 
need to be brought into memory is the count of the 
RRAs at a particular index s, which cannot be held by 


P pages. 


n | 
Rationale behind >) —: In our case, the update 
i=1 Xj 
rate for x, is six times the update rate of x». Therefore, 


the number of faults for x; = gu We sum for all the 


1 
fault rates through x,, relative to x; and hence the — 
i 

factor. 


Rate of page fault, r, is given by m/t: 
m 


ae (6) 
t 


n 
_ “Hk ) 
r minimum(uT, x,T) 


This model essentially predicts the average val- 
ues shown with no readahead in Figure 13, as verified 
by simulation. Thus, the model provides a quick way 
to calculate either the expected page fault rate given a 
buffer-cache memory constraint, or vice-versa. In ad- 
dition to this practical result, the analytical model led 
us to the following insights: 

¢ The total RRD file size is practically irrelevant. 
This is ideal since it frees us to extend our data 
retention arbitrarily, bounded only by disk space. 
For instance, recall that in our system we regu- 
larly resize our RRD files to store five minute 
averages for up to five years. (MRTG typically 
stores them for less than three days.) The re- 
sulting 17x increase in file size only nominally 
affects the page fault rates. 
The total number of RRAs with a given aggre- 
gation level is important. For instance, remov- 
ing the five minute MAX RRA,’ which dupli- 
cates the values in the five minute AVERAGE 
RRA, results in a significantly lower page fault 
rate when buffer-cache memory is scarce. 


Simulation 


In addition to deriving the analytical model, we 
developed a page fault simulation. This simulation 
provides a means by which to validate the average 
page fault rate predicted by the analytical model. It 


TWe suppose that the MRTG five minute MAX RRA ex- 
ists for historical and convenience reasons. While it allows 
one to graph just the MAX values and see the entire time 
range, users typically put both the AVERAGE and MAX 
values on the same graph. 
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also exposes the distribution, variance, and peak page 
fault rates by time of day. 


First we simulated an entire lifetime of an RRD 
file’s updates using RRDTool itself, but with synthetic 
input data. Before each update, we used our fadvise 
command’s “‘don’t need” technique to evict the file’s 
cached (hot) pages. After each update we used our fin- 
core command’s technique to determine the hot pages 
and recorded the page numbers to a log. 


Secondly, we wrote a buffer-cache simulator with a 
Least-Recently-Used (LRU) page replacement policy 
and replayed the page operation log recorded earlier to 
determine the page faults with varying numbers of 
buffer-cache pages being available per RRD file. 
While our SUT’s buffer-cache is actually managed 
using the Linux 2.6 page replacement algorithm [8], 
similar to 2Q [10], we make the simplifying assump- 
tion that LRU is suitably similar for the purpose of 
simulation. In addition to the RRD file data blocks, we 
also simulated access to the file’s inode and indirect 
blocks. On ext2 and ext3 file systems, a typical MRTG 
RRD file incurs an indirect lookup (and therefore an 
indirect block must occupy space in the buffer-cache) 
for each data block above the twelfth since only the 
first twelve blocks are directly referenced in the inode. 


From the resulting simulated behavior, we can 
determine the expected page faults for a single RRD file 
over time. We then extrapolate by multiplying by the 
target number of RRD files to determine what amount 
of buffer-cache (as limited by physical memory) re- 
duces page fault disk reads to an acceptable level. 


45 


Page Operations per Update Interval (300 seconds) 
wy 
°o 


Ww 
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Figure 13: Simulation with and without fadvise: 

Page-ins as a function of the number of buffer- 

cache pages available per MRTG RRD file. Sig- 

nificantly reduced paging rates result following 

the dramatic drops. 


We run the simulation for both the original RRD- 
Tool (default readahead). and the one patched with fad- 
vise RANDOM (no readahead). Figure 13 shows the 
average number of page-in operations for both ver- 
sions of RRDTool as a function of the number of 
buffer-cache pages available per MRTG RRD file. For 
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the original RRDTool, when the number of pages 
available in the buffer-cache is less than 18, the num- 
ber of page faults is very high. It falls at 18 because 
the 16 pages required for initial readahead are avail- 
able in the buffer-cache at this point. (It is 16+2 
because two extra blocks are required for the file 
inode and indirect blocks.) 


Also, sometimes, the AVERAGE and MAX RRAs 
get written within these 16 pages. For the patched ver- 
sion with fadvise RANDOM, the average number of page- 
ins is close to zero if more than 7 pages are available in 
the buffer-cache. Page faults still occur when 8 pages are 
available in the buffer-cache but the average page-in rate 
is extremely low as shown in Figure 14. Note that more 
pages are written out at the aggregation intervals of 30 
minutes, 2 hours and one day. The time of day 00:00 
UTC shows the peak paging activity, when aggregation 
happens for daily RRAs. 


Our simulation results are validated by the earlier 
observations of the real system’s performance with 
fadvise RANDOM. Specifically, the read and write pat- 
tern of the simulation in Figure 14 agrees with the 
observations of the real system with fadvise RANDOM 
in Figure 11, i.e., near zero read (page-in fault) rate. 
Earlier, using fincore in the real system, we also 
observed that most RRD files had only 8 pages in 
cache (Figure 12); this is the value that simulation 
shows (Figure 13) is the minimum required to achieve 
an average page fault rate of nearly zero. 
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4KB Pages 
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Figure 14: Simulation with fadvise: Average page- 
out and page-in operations by time of day given 
eight buffer-cache pages available per MRTG 
RRD file. As few as eight pages per file reduces 


page-ins to near zero. 


Scalability 


We have shown that using either application- 
level buffering or application-offered advice dramati- 
cally improves the performance of RRD systems. In 
this section we show the performance and capacity 
scalability characteristics of a very large RRD and 
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MRTG-based network measurement system by testing 
it first with just application advice, and secondly with 
advice plus application-level buffering. Thus, we explore 
the performance of the simpler of the two techniques and 
also the two techniques combined to determine the 
upper-bound to the scalability of such a system. 


Our production MRTG system today monitors 
approximately 3,000 network devices with approxi- 
mately 160,000 MRTG targets. Recall that each target 
is typically a pair of measurements such as byte, 
packet and error rates, inbound and outbound. Thus, 
the production system measures and records approxi- 
mately 320,000 data points every five minutes. Having 
already found that either improvement technique re- 
sults in satisfactory performance and thus does not 
push our system to its limit, we now construct an even 
larger system to study scalability. 


We create an MRTG system three times the size by 
replicating our existing production measurement system 
so that there are three like-configured MRTG instances 
all running on one server. Our experimental procedure is 
to begin with approximately 160,000 (i.e., the production 
MRTG system) and then progressively add 20,000 tar- 
gets every twenty minutes (10,000 from each of the 
replicated instances), until all three systems are running 
in parallel for a total exceeding 480,000 targets. 


MRTG with fadvise RANDOM 


We first tested the scalability of MRTG with RRD- 
Tool patched to do fadvise RANDOM. The performance 
results are shown in Figures 15 and 16. This is the sys- 
tem we claim as the world’s largest MRTG, operating 
with acceptable performance at around 320,000 RRD 
files. While there are some outlier points in the upper 
left of Figure 15, they occur at twenty minute intervals 
and there are exactly two per interval. Thus, these out- 
liers are an artifact of the experimental procedure 
showing latency during just the very first loop itera- 
tion of each of the two new MRTG daemons as their 
the set of hot pages for their RRD files are read into 
buffer-cache. Beyond about 320,000 targets in the 
scalability test, performance is unacceptable because 
page faults increased and CPU utilization continually 
exceeded 65%, leaving little room for other tasks. 


In Figure 16 note the spikes in CPU I/O wait 
state just following 1600 and 1800 hours. These are 
due to aggregations that occur every two hours in typi- 
cal MRTG RRD files. Furthermore, note that as the 
number of targets increases, similar spikes are seen at 
half hour intervals following 1800 hours. These spikes 
indicate that the number of hot pages exceeds the 
capacity of the buffer-cache on the SUT, resulting in 
an excessive page fault rate. We estimate our buffer- 
cache requirement to be 8 pages per file and 480,000 x 
8 x 4 KB = 14.6 GB. Although the SUT has 16 GB of 
memory in total, often only 10 GB is available for 
buffer-cache. Ultimately, the high CPU utilization in- 
terfered with SNMP polling (as evidenced by a drop in 
network traffic) so the test was stopped. 
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Figure 15: Scaling with fadvise: MRTG performance 
on the SUT progressively increasing to 483,000 
targets and 53 MRTG daemons. 
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Figure 16: Scaling with fadvise: CPU utilization on the 

SUT while progressively increasing to 483,000 tar- 


gets and 53 MRTG daemons. 


MRTG with fadvise RANDOM and RRDCache 


Subsequently, we tested the scalability of MRTG 
using RRDCache combined with RRDTool patched to 
fadvise RANDOM. The performance results are shown 
in Figures 17 and 18. These combined techniques 
yielded the highest capacity, exceeding 400,000, but 
CPU utilization reached 100% and the RRDCache- 
writer could not complete its hourly updates within an 
hour, so an increasing backlog developed from which 
it didn’t recover and the test was stopped. 

Limitations 

These scalability tests on our SUT show at least 
two capacity or performance limitations of large RRD 
and MRTG systems: 

1. When buffer-cache is scarce, page fault 
rates peak at RRD aggregation times that are 
predictable offsets from zero hours UTC. 
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ing to 486,000 targets and 52 MRTG daemons. 
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mons. 


Synchronized aggregations, and thus consoli- 
dated data points with matching timestamps, 
are a convenience to RRDTool users when 
fetching or graphing the data. However, when 
updating RRD files in near real-time, there is 
clear performance consequence to synchronized 
aggregation because work is not distributed 
evenly across time. 

2. CPU utilization approached or reached 100% 
when updating around 480,000 RRD files. This 
consists primarily of user mode CPU that we 
attribute to the MRTG and RRDCachewriter 
perl scripts. Thus, the next performance bottle- 
neck limiting the scalability of MRTG systems 
is likely CPU. 


We’ve shown that MRTG and RRD systems can 
scale to hundreds of thousands of targets or RRD files. 
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Without changing the RRD file read semantics, our 
fadvise RANDOM method allows us to scale to 320,000 
target and files with acceptable performance on our 
system with 16 GB of memory. With slightly changed 
read semantics (because of the deferred RRD file 
updates), the RRDCache method scales higher. The 
factors that limit further scalability are (7) the CPU 
required for target processing (in perl) and (ii) RRD- 
Tool’s aggregations at synchronized times across all 
RRD files. Further gains can be achieved by profiling 
and optimizing the system software (e.g., MRTG) and 
by appropriately sizing the systems physical memory 
so that an even larger buffer-cache is available. 


Related Work 


Our work is informed by prior operating system 
and application performance improvement techniques. 


Within the operating system, better buffer-cache 
management techniques can help reduce the number 
of disk reads and writes. There are a number of poli- 
cies described in the literature (e.g., FIFO, LRU, LFU, 
Clock, Random, Segmented FIFO, 2Q [10], and LRU- 
K). The readahead by the OS can limit the amount of 
useful data that can be cached. In some circumstances, 
improvements to the adaptive readahead algorithm can 
significantly improve performance [14]. 


The ability to accept hints or advice from appli- 
cations with the aim of more efficiently managing 
resources and improving performance was imple- 
mented in the Pilot operating system [20]. An inter- 
face by which operating systems can accept such 
advice specifically to provide buffer management has 
long-since been suggested [25]. Later work finds that 
application “hints” to inform the operating system of 
file access patterns improves performance [15]. Today, 
some operating systems have support for application 
advice via the fadvise and madvise APIs [19]. 


A related approach is to the change the kernel to 
include functionality which enables application guided 
buffer-cache control [2]. Another possibility is to sim- 
ulate the cache replacement algorithm to build a rea- 
sonably accurate model of the contents of the cache 
for reordering reads and writes [4]. 


Within our version of the operating system (Linux 
2.6), there are four I/O scheduling algorithms avail- 
able: Completely Fair Queuing (CFQ) scheduling, 
deadline elevator, NOOP scheduler and Anticipatory 
elevator scheduling [21]. The scheduling algorithm 
can prioritize individual I/O requests, such as reads 
over writes, and therefore affects application perfor- 
mance when page faults occur. 


A number of software systems inspired by the 
original MRTG have improved its performance in 
some ways. The current MRTG, Cricket [1], Cacti [5], 
and Torrus [23] applications use RRDTool [13] to 
achieve improved performance. Cricket [1] allows 
configuration of more parallel measurements per file, 
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but this offers only a modest performance improve- 
ment since tens to hundreds of thousands of RRD files 
would still be required. RTG [3] made significant 
changes in the polling (but that is not our bottleneck) 
and replaced the file I/O with relational database I/O. 
We have no reason to believe this would offer better 
I/O performance, and it significantly changes the user 
interface to the data. JRobin [11] completely reimple- 
ments RRDTool in java, improving performance in 
some areas but decreasing it in others and modifying 
the RRD file format in the process. 


Recently, RRD users have proposed design 
changes or made customizations to introduce an 
application-level cache maintained by a daemon that 
intercepts updates [22, 12, 9]. 


Discussion and Future Work 


Our investigation and experimentation thus far 
suggests at least the following potential items of future 
work. 

¢ File Types: UNIX-based operating systems lack 
file types; a file is simply a stream of bytes and 
this is often cited as an advantage or, at least, a 
successful simplification. However, this is one 
reason that the RRD file update access pattern is 
not handled well by the adaptive readahead 
algorithm. Perhaps the readahead and other 
behaviors, such as caching, could be influenced 
or determined at file open time based on a file’s 
type, as defined by file name extension (e.g., 
““rrd”) or by magic, the file command’s magic 
number file. 
File Read Performance: Although we have 
disabled readahead to achieve better update 
performance, we have not thoroughly investi- 
gated its effect on RRD fetch or graphing (read) 
performance. We surmise that advising for ran- 
dom I/O helps read performance too, but have 
not carefully measured it. 
Also, we selected the Linux deadline I/O sched- 
uler because it prioritizes reads over writes, but 
evaluating this decision is left for future work. 
Linux’ Completely Fair Queuing (CFQ) sched- 
uler may perform acceptably as well; we have 
not compared them. 
RRD Update Interval: Some network opera- 
tors desire more frequent measurements, such 
as a One minute interval rather than five. Future 
work might explore if RRD scales similarly in 
this situation. We believe our page fault rate 
model is valid for all update intervals some- 
what greater than that of the system’s page 
replacement algorithm. (Note that dirty pages 
are flushed at five second intervals by pdflush in 
Linux 2.6.) 
Our performance results suggest that the CPU 
load would be a limiting factor as the update 
interval decreases, i.e., if updates are more fre- 
quent. Perhaps simply choosing a one minute 
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interval would constrain the capacity to about one 
fifth that of when using a five minute interval. 
Judicious partitioning of the set of targets 
would help, e.g., using a one minute interval for 
measuring the core and/or distribution links and 
a five minute interval for more numerous ac- 
cess ports. 

MRTG CPU Utilization: As is to be expected 
in system performance work, we found that 
eliminating the I/O bottleneck exposed the next 
bottleneck, CPU, that limits scalability. It seems 
this high CPU utilization is primarily due to the 
MRTG perl script, thus profiling and optimiz- 
ing it could improve performance. 

“Gaming” the Readahead Algorithm: While 
our platform provides the BSD and POSIX fad- 
vise APIs, others do not or do not yet completely 
implement them. Can we instead “game” their 
readahead algorithms, for instance by perform- 
ing otherwise unnecessary no-op seek opera- 
tions, to likewise disable readahead? What is the 
performance cost of doing so? If adaptive reada- 
head algorithms are similar, this might also have 
portability benefits. (We’ve observed that some 
operating systems use a readahead of zero ini- 
tially,® so they exhibit the desired behavior for 
RRD files without needing advice or adapta- 
tion.) 

Alternatively, the adaptive readahead algorithm 
could be improved to respect a file’s access his- 
tory, i.e., by initially setting readahead based on a 
series of previous file access (open/close) “ses- 
sions” by the current process or prior processes. 
RRD File Design: We’ve seen how the RRD 
file organization influences its update perfor- 
mance. Is there a better organization for RRD 
files? For example, locality of data for updates 
would improve if RRAs with the same number 
of PDPs (but different consolidation functions) 
could be interleaved in the same block so that 
their corresponding data points are nearby when 
updated. 

Is there a way to avoid synchronized aggrega- 
tions or consolidations across all RRD files? 
Perhaps we can introduce a stochastic compo- 
nent to skew those updates slightly in time. 
This is difficult to do without affecting RRD 
file read semantics and without introducing an 
independent thread to perform updates. 


Conclusions 


In conclusion, we’ve provided a general analysis 
method and two new tools, fincore (available at [18]) 
and fadvise (available at [16]), that expose readahead 
and buffer-cache behaviors in running systems. With- 
out such tools, these performance-critical aspects of 


8Apple’s OS X with HFS+ file-system has an initial 
readahead of zero. 


Plonka, Gupta, & Carder 


the operating system are hidden from system adminis- 
trators and users. 


By both modeling and simulation, we’ve pro- 
vided a detailed analysis of the I/O characteristics of 
RRD file updates. We’ve shown how the locality of 
RRD file accesses can be leveraged, limiting page 
faults and disk I/O, resulting in improved performance 
and scalability for RRD systems. We’ve found that 
RRD buffer-cache utilization and page faults are 
defined by subtleties in the RRD file format and RRD- 
Tool’s access pattern, rather than simply being defined 
by file size. This is advantageous because it means 
that larger RRD systems can be operated than would 
otherwise be thought. 


We’ve outlined two effective methods to improve 
RRD performance. The first, RRDCache (available at 
[6]), is what we’ve called application-level caching or 
buffering. The second, for which we provide a patch to 
RRDTool (available at [17]), issues application advice 
to the operating system to select readahead and buffer- 
cache behavior appropriate for random RRD file I/O. 
While the two methods are starkly different, both 
eliminate the buffer-cache memory bottleneck that has 
been observed in large RRD network measurement sys- 
tems. Conservatively, either technique triples the capac- 
ity of such systems. Together, these complementary 
techniques can be applied to maximize performance. 


Finally, we’ve shown that system tuning and mi- 
nor capacity-enhancing code changes improve Round 
Robin Database performance so that RRDTool can be 
used for even the largest managed networks. 
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Appendix: Performance Recommendations for 
RRD and MRTG Systems 


¢ When building a very large RRD measurement 
system, dedicate the machine to this purpose. 
Since RRD is a file-based database, it relies on 
the buffer-cache that is shared across all system 
activity. Because of RRD’s unique file access 
characteristics and buffering requirements, it is 
easier to achieve performance gains by tuning 
the system just for RRD. 

e Use an RRDTool that has our fadvise RANDOM 
patch. On systems that have a fairly aggressive 
initial readahead (such as Linux), this will very 
likely increase file update performance by reduc- 
ing the page fault rate and the buffer-cache 
memory required. 

e Avoid file-level backups of RRD files unless the 
set of RRD files complete fit into buffer-cache 
memory. File-level backups read each modified 
file completely and sequentially; this can fill the 
buffer-cache and subsequently causes more page 
faults on RRD updates. Backups are essentially 
indifferentiable from application access, and 
thus unnecessarily populate the system’s buffer- 
cache with content that won’t be re-used soon. 
(Note that backup programs could call fadvise 
NOREUSE or fadvise DONTNEED to inform the 
operating system that the file content will not 
be re-used.) 
Split MRTG targets into a number of groups 
and run a separate daemon for each. In our sys- 
tem, we reconfigure daily and run a target_split- 
ter script to produce a new set of “.cfg”’ files 
each with approximately 10,000 targets per 
MRTG daemon. Note that polling performance 
is also influenced by the SNMP agent perfor- 
mance on the network device polled. So, if the 
splitting results in grouping like targets together 
based on the model of device monitored, there 
could be quite a disparity in time to complete 
the MRTG “poll targets” phase. 
Do not create RRD files all at once. By stagger- 
ing the start times, updates to like RRAs will 
cross block boundaries at different times, dis- 
tributing the page faults that occur on block 
boundary crossings. As a network is deployed 
and grows, these RRD file start times would 
naturally be staggered, but this could be quite 
different when introducing measurement to an 
existing deployed network. 

Run a caching resolver or a nameserver on the 

localhost, i.e., the MRTG system itself. This 

reduces “‘poll targets” latency due to host name 
resolution; MRTG performs very many DNS 
name resolutions when hostnames are used 

(rather than IP addresses) in target definitions. 

Configure an appropriate number of forks for 

each MRTG daemon to minimize the time for 
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the “poll targets” phase. On our system, 4 
forks per daemon works well to keep polling in 
the tens of seconds for 10,000 targets. This 
might differ for a wide-area network. 

Place RRD files in a file-system of their own, 
ideally one associated with separate logical vol- 
umes or disks. This gives the system admini- 
strator flexibility to change mount options or 
other file-system options. It also isolates the 
system activity data (e.g., as displayed by sar) 
from unrelated activity. 

Consider mounting the file-system that contains 
the RRD files with the “‘noatime” and “‘nodira- 
time” options so that RRD file reads do not 
require an update to the file inode block. Of 
course the effect of this is that file access times 
will be inaccurate, but often these are not of 
interest for “‘.rrd”’ files. 

Consider enabling dir_index on ext file-systems 
to speed up lookups in large directories. MRTG 
places all RRD files in the same directory, and 
we’ ve scaled to hundreds of thousands. 
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ABSTRACT 


In virtual machine environments each application is often run in its own virtual machine 
(VM), isolating it from other applications running on the same physical machine. Contention for 
memory, disk space, and network bandwidth among virtual machines, coupled with an inability to 
share due to the isolation virtual machines provide, leads to heavy resource utilization. 
Additionally, VMs increase management overhead as each is essentially a separate system. 


Stork is a package management tool for virtual machine environments that is designed to 
alleviate these problems. Stork securely and efficiently downloads packages to physical machines 
and shares packages between VMs. Disk space and memory requirements are reduced because 
shared files, such as libraries and binaries, require only one persistent copy per physical machine. 
Experiments show that Stork reduces the disk space required to install additional copies of a 
package by over an order of magnitude, and memory by about 50%. Stork downloads each 
package once per physical machine no matter how many VMs install it. The transfer protocols 
used during download improve elapsed time by 7X and reduce repository traffic by an order of 
magnitude. Stork users can manage groups of VMs with the ease of managing a single machine — 
even groups that consist of machines distributed around the world. Stork is a real service that has 


run on PlanetLab for over four years and has managed thousands of VMs. 


Introduction 


The growing popularity of virtual machine (VM) 
environments such as Xen [3], VMWare [31], and 
Vservers [17, 18], has placed new demands on pack- 
age management systems (e.g., apt [2], yum [36], 
RPM [27]). Traditionally, package management sys- 
tems deal with installing and maintaining software on 
a single machine whether virtual or physical. There 
are no provisions for inter-VM sharing, so that multi- 
ple VMs on the same physical machine individually 
download and maintain separate copies of the same 
package. There are also no provisions for inter- 
machine package management, centralized administra- 
tion of which packages should be installed on which 
machines, or allowing multiple machines to download 
the same package efficiently. Finally, current package 
management systems have relatively inflexible secu- 
rity mechanisms that are either based on implicit trust 
of the repository, or public/private key signatures on 
individual packages. 


Stork is a package management system designed 
for distributed VM environments. Stork has several 
advantages over existing package management sys- 
tems: it provides secure and efficient inter-VM pack- 
age sharing on the same physical machine; it provides 
centralized package management that allows users to 
determine which packages should be installed on 
which VMs without configuring each VM individu- 
ally; it allows multiple physical machines to download 
the same package efficiently; it ensures that package 


updates are propagated to the VMs in a timely fashion; 
and it provides a flexible security mechanism that 
allows users to specify which packages they trust as 
well as delegate that decision on a per-package basis 
to other (trusted) users. 


Stork’s inter-VM sharing facility is important for 
reducing resource consumption caused by package 
management in VM environments. VMs are excellent 
for isolation, but this very isolation can increase the 
disk, memory, and network bandwidth requirements of 
package management. It is very inefficient to have each 
VM install its own copy of each package’s files. The 
same is true of memory: if each VM has its own copy 
of a package’s files then it will have its own copy of the 
executable files in memory. Memory is often more of a 
limiting factor than disk, so Stork’s ability to share 
package files between VMs is particularly important for 
increasing the number of VMs a single physical ma- 
chine can support. In addition, Stork reduces network 
traffic by only downloading a package to a physical 
machine once, even if multiple VMs on the physical 
machine install it. 


Stork’s inter-machine package management fa- 
cility enables centralized package management and effi- 
cient, reliable, and timely package downloads. Stork 
provides package management utilities and configura- 
tion files that allow the user to specify which packages 
are to be installed on which VMs. Machines download 
packages using efficient transfer mechanisms such as 
BitTorrent [9] and CoBlitz [22], making downloads 
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efficient and reducing the load on the repository. Stork 
uses fail-over mechanisms to improve the reliability of 
downloads, even if the underlying content distribution 
systems fail. Stork also makes use of publish/subscribe 
technology to ensure that VMs are notified of package 
updates in a timely fashion. 


Stork provides all of these performance benefits 
without compromising security; in fact, Stork has addi- 
tional security benefits over existing package manage- 
ment systems. First, Stork shares files securely between 
VMs. Although a VM can delete its link to a file, it 
cannot modify the file itself. Second, a user can se- 
curely specify which packages he or she trusts and may 
delegate this decision for a subset of packages to 
another user. Users may also trust other users to know 
which packages not to install, such as those with secu- 
rity holes. Each VM makes package installation deci- 
sions based on a user’s trust assumptions and will not 
install packages that are not trusted. While this paper 
touches on the security aspects of the system that are 
necessary to understand the design, a more rigorous 
and detailed analysis of security is available through 
documentation on our website [29]. 


In addition, Stork is flexible and modular, allowing 
the same Stork code base to run on a desktop PC, a 
Vserver-based virtual environment, and a PlanetLab 
node. This is achieved via pluggable modules that isolate 
the platform-specific functionality. Stork accesses these 
modules through a well-defined API. This approach 
makes it easy to port Stork to different environments and 
allows the flexibility of different implementations for 
common operations such as file retrieval. 


Stork has managed many thousands of VMs and 
has been deployed on PlanetLab [23, 24] for over four 
years. Stork is currently running on hundreds of Planet- 
Lab nodes and its package repository receives a request 
roughly every ten seconds. Packages installed in multiple 
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VMs by Stork typically use over an order of magnitude 
less space and 50% the memory of packages installed by 
other tools. Stork also reduces the repository load by 
over an order of magnitude compared to HTTP-based 
tools. Stork is also used in the Vserver [18] environment 
and can also be used in non-VM environments (such as 
on a home system) as an efficient and secure package 
installation system. The source code for Stork is avail- 
able at http://www.cs.arizona.edu/stork . 


Stork 


Stork provides manual management of packages 
on individual VMs using command-line tools that have 
a syntax similar to apt [2] or yum [36]. Stork also pro- 
vides centralized management of groups of VMs. This 
section describes an example involving package man- 
agement, the configuration files needed to manage VMs 
with Stork, and the primary components of Stork. 


An Example 


Consider a system administrator that manages 
thousands of machines at several sites around the 
globe. The company’s servers run VM software that 
allow different production groups more flexible use of 
the hardware resources. In addition, the company’s 
employees have desktop machines that have different 
software installed depending on their use. 


The system administrator has just finished testing 
a new security release for a fictional package foobar and 
she decides to have all of the desktop machines used for 
development update to the latest version along with any 
testing VMs that are used by the coding group. The ad- 
ministrator modifies a few files on her local machine, 
signs them using her private key, and uploads them to a 
repository. Within minutes all of the desired machines 
that are online have the updated foobar package in- 
stalled. As offline machines come online or new VMs 


Central | Signed and 
File Type Sepostens — ‘en ae 


User Private Key 
User Public Key 
Master Configuration File 


Trusted Packages (TP) 
Pacman Packages 
Pacman Groups 
Packages (RPM, tar.gz) 
Package Metadata 
Repository Metahash 





Secure Hash 
No 
Signed Only 


Table 1: Stork File Types: This table shows the different types of files used by Stork. The repository column indi- 
cates whether or not the file is obtained from the repository by the clients. The client column indicates whether 
or not the file is used for installing packages or determining which packages should be installed locally based 
upon the files provided by the centralized management system. The centralized management column indicates if 
the files are created by the management tools. The signed/embed column indicates which files are signed and 


have a public key embedded in their name. 


+ In order to automatically deploy Stork on PlanetLab, this restriction is relaxed. See the PlanetLab section for 


more details. 
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are created, they automatically update their copies of 
foobar as instructed. 


The subsequent sections describe the mecha- 
nisms Stork uses to provide this functionality to its 
users. The walkthrough section revisits this example 
and explains in detail how Stork provides the func- 
tionality described in this scenario. 


File Types 

Stork uses several types of files that contain dif- 
ferent information and are protected in different ways 
(Table 1). The user creates a public/private key pair 
that authenticates the user to the VMs he or she con- 
trols. The public key is distributed to all of the VMs 
and the private key is used to sign the configuration 
files. In our previous example, the administrator’s 
public key is distributed to all of the VMs under her 
control. When files signed by her private key were 
added to the repository, the authenticity of these files 
was independently verified by each VM using the 
public key. 

The master configuration file is similar to those 
found in other package management tools and indi- 
cates things such as the transfer method, repository 
name, user name, etc. It also indicates the location of 
the public key that should be used to verify signatures. 


The user’s trusted packages file (7P file) indi- 
cates which packages the user considers valid. The TP 
file does not cause those packages to be installed, but 
instead indicates trust that the packages have valid 
contents and are candidates for installation. For exam- 
ple, while the administrator was testing the latest 
release of foobar she could add it to her trusted pack- 
ages file because she believes the file is valid. 


There are two pacman files used for centralized 
management. The groups.pacman file allows VMs to 
be categorized into convenient groups. For example, 
the administrator could configure her pacman groups 
file to create separate groups for VMs that perform 
different tasks. VMs can belong in multiple groups 
such as ALPHA and ACCOUNTING for an alpha test 
version of accounting software. Any package manage- 
ment instructions for either the ALPHA group or the 
ACCOUNTING group would be followed by this 
VM. 


The packages.pacman file specifies what actions 
should be done on a VM or a group of VMs. Packages 
can be installed, updated, or removed. Installation is 
different from updating in that installation will do 
nothing if there is a package that meets the criteria 
already installed while update ensures that the pre- 
ferred version of the package is installed. For exam- 
ple, when asked to install foobar, if any version of the 
package is currently installed then no operation will 
occur. If asked to update foobar, Stork checks to see if 
the administrator’s TP file specifies a different version 
of foobar and if so, replaces the current version with 
the new version. 


Stork: Package Management for Distributed VM Environments 


The packages (for example, the foobar RPM 
itself) contain the software that is of interest to the 
user. The package metadata is extracted from pack- 
ages and is published by the repository to describe the 
packages that are available. The repository metahash 
is a special file that is provided by the repository to 
indicate the current repository state. 


Architecture 


Stork consists of four main components: 

a repository that stores configuration files, pack- 
ages, and associated metadata; 

a set of client tools that are used in each Stork 
client VM to manage its packages by interacting 
either directly with the repository or through the 
nest when it is available; 

a nest process that runs on physical machines 
and coordinates sharing between VMs as well 
as providing repository metadata updates to its 
client VMs and downloading packages; 

and centralized management tools that allows a 
user to control many VMs concurrently, create 
and sign packages, upload packages to the repos- 
itory, etc. 






Administrator 


Figure 1: Stork Overview. Stork allows centralized 
administration and sharing of packages. The ad- 
ministrator publishes packages and metadata on 
the repository. Updates are propagated to VMs 
running on distributed physical machines. Each 
physical machine contains a single nest VM, and 
one or more client VMs that run the Stork client 
tools. 


The client tools consist of the stork command-line 
tool (referred to simply as stork), which allows users to 
install packages manually, and pacman, which supports 
centralized administration and automated package in- 
stallation and upgrade. While a client VM may com- 
municate with the repository directly, it is far more 
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efficient for client VMs to interact with their local nest 
process, who interacts with the repository on their 
behalf. 
Repository 

The Stork repository’s main task is to serve files 
much like a normal web server. However, the repository 
is optimized to efficiently provide packages to Stork 
client VMs. First, the repository provides secure user 
upload of packages, trusted packages files, and pacman 
packages and groups files. Second, the repository pushes 
notifications of new content to interested VMs. Third, 
the repository makes packages available via different 
efficient transfer mechanisms such as BitTorrent. 


Handling Uploaded Data The Stork repository 
allows multiple users to upload files while retaining 
security. TP, groups.pacman, and packages.pacman files 
must be signed by the user that uploads them. Every 
signed file has a timestamp for the signature embed- 
ded in the portion of the file protected by the signa- 
ture. The public key of the user is embedded in the 
file name of the signed file (similar to self-certifying 
path names [19]). This avoids naming conflicts and 
allows the repository to verify the signature of an 
uploaded file. The repository will only store a signed 
file with a valid signature that is newer than any exist- 
ing signed file of the same name. This prevents replay 
attacks and allows clients to request files that match a 
public key directly. 


Packages and package metadata are treated differ- 
ently than configuration files. These files are not signed, 
but instead incorporate a secure hash of their contents in 
their names. This prevents name collisions and allows 
clients to request packages directly by secure hash. In all 
cases, the integrity of a file is verified by the recipient 
before it is used (either by checking the signature or the 
secure hash, as appropriate). The repository only per- 
forms these checks itself to prevent pollution of the 
repository and unnecessary downloads, rather than to 
ensure security on the clients. 


Pushing Notifications The repository notifies 
interested clients when the repository contents have 
changed. The repository provides this functionality by 
pushing an updated repository metahash whenever data 
has been added to the repository. However, this does not 
address the important question of what data has been 
updated. This is especially difficult to address when 
VMs may miss messages or suffer other failures. 


One solution is for the repository to push out 
hashes of all files on the repository. As there are many 
thousands of metadata files on the repository, it is too 
costly to publish the individual hashes of all of them 
and have the client VMs download each metadata file 
separately. Instead, the repository groups metadata 
files together in a tarball organized by type. For exam- 
ple, one tarball contains all of the trusted packages 
files, another with all of the pacman files, etc. The 
hashes of these tarballs are put into the repository 
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metahash which is pushed to each interested client 
VM. No matter how many updates the client VM 
misses, it can examine the hash of the local tarballs 
and the hashes provided by the repository and deter- 
mine what needs to be retrieved. 


Efficient Transfers The repository makes all of its 
files available for download through HTTP. However, 
having each client download its files via separate HTTP 
connections is prohibitively expensive. The repository 
therefore supports different transfer mechanisms for bet- 
ter scalability, efficiency, and performance. Some trans- 
fer mechanisms are simple (like CoBlitz and Coral) 
which require no special handling by the repository and 
others (like BitTorrent) which do. 


To support BitTorrent [9] downloads the reposi- 
tory runs a BitTorrent tracker and a modified version 
of the btlaunchmany daemon provided by BitTorrent. 
The btlaunchmany daemon monitors a directory for any 
new or updated files. When a new file is uploaded to the 
repository it is placed in the monitored directory. When 
the daemon notices the new file it creates a torrent file 
that is later seeded. Unique naming is achieved by 
appending the computed hash of the shared file to the 
name of the torrent. The torrent file is placed in a public 
location on the repository for subsequent download by 
the clients through HTTP. 


Client Tools 


The client tools are used to manage packages in a 
client VM and include the stork, pacman, and stork_ 
receive_update commands. The stork tool uses com- 
mand-line arguments to install, update, and remove 
packages. Its syntax is similar to apt [2] or yum [36]. 
The stork tool resolves dependencies and installs addi- 
tional packages as necessary. It also upgrades and 
removes packages. The stork tool downloads the latest 
metadata from package repositories, verifies that pack- 
ages are trusted by the user’s TP file, and only installs 
trusted files. 


Package management with the stork tool is a 
complex process involving multiple steps including 
dependency resolution, trust verification, download, 
and installation. For example, consider the installation 
of the foobar package. Assume foobar depends on a few 
other packages, such as emacs and glibc, before foobar 
itself can be installed. In order to perform the installa- 
tion of foobar, the stork tool must determine whether foo- 
bar, emacs, and glibc are already installed on the client 
and if not, locate candidate versions that satisfy the 
dependencies. These steps are similar to those per- 
formed by other package managers [2, 36, 27]. Finally 
Stork ensures that those candidates satisfy the trust 
requirements that the user has specified. 


Figure 2 shows a TP file example. This file 
specifically allows emacs-2.2-5.i386.rpm, several ver- 
sions of foobar, and customapp-1.0.tar.gz to be installed. 
Each package listed in the TP file includes the hash of 
the package, and only packages that match the hashes 
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may be installed. It trusts the planetlab-v4 user to know 
the validity of any package it says (this user has a list 
of hashes of all of the Fedora Core 4 packages). It also 
trusts the stork user to know the validity of any pack- 
ages that start with “stork”’. 


Once satisfactory trusted candidates have been 
found, Stork downloads the packages from the reposi- 
tory and verifies that the packages it downloaded 
match the entries in the TP file, including the secure 
hashes. Finally, the packages themselves are installed. 


Package removal is much less complex than 
installation. Before removing a package, the stork 
command first checks to see if other packages depend 
upon the package to be removed. For RPM packages, 
stork leverages the rpm command and its internal data- 
base to check dependencies. Tar packages do not sup- 
port dependencies at this time and can always be 
removed. If there are dependencies that would be bro- 
ken by removal of the package, then stork reports the 
conflict and exits. Stork removes an installed package 
by deleting the package’s files and running the unin- 
stall scripts for the package. 


The pacman (‘‘package manager”) tool is the 
entity in a VM that locally enacts centralized admini- 
stration decisions. The pacman tool invokes the appro- 
priate stork commands based on two configuration files: 
groups.pacman (Figure 3) and packages.pacman (Figure 
4). The groups.pacman file is optional and defines VM 
groups that can be used by an administrator to manage 
a set of VMs collectively. The groups.pacman syntax 
supports basic set operations such as union, intersec- 
tion, compliment, and difference. For example, an ad- 
ministrator for a service may break their VMs into 
alpha VMs, beta VMs, and production VMs. This 
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allows developers to test a new release on alpha VMs 
(where there are perhaps only internal users) before 
moving it to the beta VMs group (with beta testers) 
and finally the the production servers. 


<GROUPS> 

<GROUP NAME="ALPHA"> 
<INCLUDE NAME="planetlabl.arizona.net"/> 
<INCLUDE NAME="planetlab2.arizona.net"/> 
</GROUP> 

<GROUP NAME="ACCOUNTING"> 
<INCLUDE NAME="ALPHA"/> 
<INCLUDE NAME="pl1.unm.edu"/> 
</GROUP> 

</GROUPS> 


Figure 3: Example groups.pacman. The “ALPHA” 
group consists of two machines in Arizona. The 
“ACCOUNTING” group also includes a machine 
at the University of New Mexico. 


The packages.pacman file specifies which pack- 
ages should be installed, updated, or removed in the 
current VM based on a combination of VM name, 
group, and physical machine. This makes it easy, for 
example, to specify that a particular package should be 
installed on all VMs on a physical machine, while 
another package should only be installed on alpha 
VMs, etc. 


Although pacman can be run manually, typically 
it is run automatically via one of several mechanisms. 
First, pacman establishes a connection to the stork_ 
receive_update daemon. This daemon receives the re- 
pository metahashes that are pushed by the repository 
whenever there is an update. Upon receiving this noti- 
fication, stork_receive_update alerts pacman to the new 





<?xml version="1.0" encoding="ISO-8859-1" standalone="yes" ?> 


<TRUSTEDPACKAGES> 


<!-- Trust some packages that the user specifically allows --> 
<FILE PATTERN="emacs-2.2-5.i386.rpm" HASH="aed4959915ad09a2b02£384d140c4\ 


626b0eba732" ACTION="ALLOW"/> 


<FILE PATTERN="foobar-1.01.i386.rpm" HASH="16b6d22332963d54e0a034c11376a\ 


2066005c470" ACTION="ALLOW"/> 


<FILE PATTERN="foobar-1.0.i386.rpm" HASH="3945£d48567738a28374c3b238473\ 


09634ee37fd" ACTION="ALLOW"/> 
<FILE PATTERN="simple-1.0.tar.gz" 
293510£d341" ACTION="ALLOW"/> 


HASH="23434850ba2934c39485d293403e3\ 


<!-- Allow access to the planetlab Fedora Core 4 packages --> 

<USER PATTERN="*" USERNAME="planetlab-v4" PUBLICKEY="MFwwDQYJKoZIhvcNAQEB \ 
BQADSwAwSAJBALtGteQPdLa0kYvt+k1lFWTk1H9Y7 frYh15JVlhgJa5P1IGI3yK+R22UsD65_J4P\ 
V92RUgVd_uJMuB8Q4bilw406JMCAWEAAQ" ACTION="ALLOW"/> 


<!-- Allowing the ‘stork’ user lets stork packages be installed --> 
<USER PATTERN="stork*" USERNAME="stork" PUBLICKEY="MFwwDQYJKoZIhvcNAQEBBQADSwaAw\ 
SAJBAKgZCj £KD19TSoc1lfBuZsQze6bXtutQYF64TLQ1I9fFgEg2CDyGQVOsZ2CaX1ZEZ O69AYZ\ 


p8nj+YJLIJM3+W3DMCAWEAAQ" ACTION="ALLOW"/> 


</TRUSTEDPACKAGES> 


Figure 2: Example TP File. This file specifies what packages and users are trusted. Only packages allowed by a TP 
file may be installed. FILE actions are used to trust individual packages. USER actions allow hierarchical trust 
by specifying a user whose TP file is included. The signature, timestamp, and duration are not shown and are 


contained in an XML layer that encapsulates this file. 
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information. A change to the repository metahash 
indicates that the repository contents have changed 
which in turn may change which packages are in- 
stalled, etc. Second, when stork_receive_update is un- 
available pacman wakes up every 5 minutes and polls 
the repository for the repository metahash. As before, 
if there is a discrepancy between the stored data and 
the described data, pacman downloads the updated 
files. Third, pacman also runs when its configuration 
files change. 


The stork_receive_update daemon runs in each 
client VM and keeps the repository’s metahash up-to- 
date. Metadata is received from the repositories using 
both push and pull. Pushing is the preferred method 
because it reduces server load, and is accomplished 
using a multicast tree or publish/subscribe system such 
as PsEPR [5]. Heartbeats are pushed if no new meta- 
hash is available. If stork_receive_update doesn’t receive 
a regular heartbeat it polls the repository and downloads 
new repository metahash if necessary. This download is 
accomplished using an efficient transfer mechanism 
from one of Stork’s transfer modules (discussed further 
in the transfer modules section). This combination of 
push and pull provides an efficient, scalable, fault toler- 
ant way of keeping repository information up-to-date in 
the VMs. 

Nest 

The Stork nest process enables secure file-shar- 
ing between VMs, prevents multiple downloads of the 
same content by different VMs, and maintains up-to- 
date repository metadata. It accomplishes these in two 
ways. First, it operates as a shared cache for its client 
VMs, allowing metadata and packages to be down- 
loaded once and used by many VMs. Second, it per- 
forms package installation on behalf of the VMs, 
securely sharing read-only package files between mul- 
tiple VMs that install the package (discussed further in 
the sharing section). The nest functionality is imple- 
mented by the stork_nest daemon. 


The stork_nest daemon is responsible for main- 
taining connections with its client VMs and processing 
requests that arrive over those connections (typically 
via a socket, although this is configurable). A client 
must first authenticate itself to stork_nest. The authen- 
tication persists for as long as the connection is estab- 
lished. Once authenticated, the daemon then fields 
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requests for file transfer and sharing. File transfer 
operations use the shared cache feature of the reposi- 
tory to provide cached copies of files to the clients. 
Sharing operations allow the clients to share the con- 
tents of packages using the prepare interface (dis- 
cussed further in the section on prepare modules). 


Typically, the nest runs on each machine that 
runs Stork; however, there may be cases where the 
nest is not run, such as in a desktop machine or a 
server that does not use VMs. In the case where no 
nest is running or the nest process fails, the client tools 
communicate directly with the repository. 

Centralized Management Tools 

The centralized management tools allow Stork 
users to manage their VMs without needing to contact 
the VMs directly. In our example the administrator 
wanted to install foobar automatically on applicable 
systems under her control rather than logging into 
them individually. Unlike the client tools that are run 
in Stork client VMs, the centralized management tools 
are typically run on the user’s desktop machine. They 
are used to create TP files, pacman packages and 
groups files, the master configuration file, public/pri- 
vate keypairs, etc. These files are used by the client 
tools to decide what actions to perform on the VM. In 
addition to managing these files, the centralized man- 
agement tools also upload metadata and/or packages 
to the repository, and assist the user in building pack- 
ages. 

The main tool used for centralized management 
is storkutil, a command-line tool that has many differ- 
ent functions including creating public/private key 
pairs, signing files, extracting metadata from pack- 
ages, and editing trusted packages, pacman packages 
and groups files. Administrators use this tool to create 
and modify the files that control the systems under 
their control. While files can be edited by other tools 
and then resigned, storkutil has the advantage of auto- 
matically resigning updated files. After updating these 
files they are then uploaded to the repository. 


Stork on PlanetLab 


Stork currently supports the Vserver environ- 
ment, non-VM machines, and PlanetLab [23, 24]. The 
PlanetLab environment is significantly different from 


<CONFIG SLICE="stork" GROUP="ACCOUNTING"> 
<INSTALL PACKAGE="foobar" VERSION="2.2"/> 


<REMOVE PACKAGE="vi"/> 
</CONFIG> 
<CONFIG> 
<UPDATE PACKAGE = "firefox"/> 
</CONFIG> 
</PACKAGES> 


Figure 4: Example packages.pacman. VMs in the slice (a term used to mean a VM on PlanetLab) “stork” and in 
the group “ACCOUNTING” will have foobar 2.2 installed and vi removed. All VMs in this user’s control will 
have firefox installed and kept up-to-date with the newest version. 


84 21st Large Installation System Administration Conference (LISA ’07) 


Cappos, et al. 


the other two, so several extensions to Stork have been 
provided to better support it. 


PlanetLab Overview 


PlanetLab consists of over 750 nodes spread 
around the world that are used for distributed system 
and network research. Each PlanetLab node runs a 
custom kernel that superficially resembles the Vserver 
[18] version of Linux. However there are many isola- 
tion, performance, and functionality differences. 


The common management unit in PlanetLab is 
the slice, which is a collection of VMs on different 
nodes that allow the same user(s) to control them. A 
node typically contains many different VMs from 
many different slices, and slices typically span many 
different nodes. The common PlanetLab (mis)usage of 
the word “slice” means both the collection of simi- 
larly managed VMs and an individual VM. 


Typical usage patterns on PlanetLab consist of an 
authorized user creating a new slice and then adding it 
to one or more nodes. Many slices are used for rela- 
tively short periods of time (a week or two) and then 
removed from nodes (which tears down the VMs on 
those nodes). It is not uncommon for a group that 
wants to run an experiment to create and delete a slice 
that spans hundreds of nodes in the same day. There 
are relatively loose restrictions as to the number of 
nodes slices may use and the types of slices that a 
node may run so it is not uncommon for slices to span 
all PlanetLab nodes. 


Bootstrapping Slices on PlanetLab 


New slices on PlanetLab do not have the Stork 
client tools installed. Since slices are often short-lived 
and span many nodes, requiring the user to log in and 
install the Stork client tools on every node in a slice is 
impractical. Stork makes use of a special initscript to 
automatically install the Stork client tools in a slice. 
The initscript is run whenever the VMM software 
instantiates a VM for the slice on a node. The Stork 
initscript communicates with the nest on the node and 
asks the nest to share the Stork client tools with it. If 
the nest process is not working, the initscript instead 
retrieves the relevant RPMs securely from the Stork 
repository. 

Centralized Management 


Once the Stork client tools are running they need 
the master configuration file and public key for the 
slice. Unfortunately the ssh keys that are used by Plan- 
etLab to control slice access are not visible within the 
slice, so Stork needs to obtain the keys through a dif- 
ferent mechanism. Even if the PlanetLab keys were 
available it is difficult to know which key to use 
because many users may be able to access the same 
VM. Even worse, often a different user may want to 
take control of a slice that was previously managed by 
another user. Stork’s solution is to store the public key 
and master configuration file on the Stork repository. 
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The repository uses PlanetLab Central’s API to vali- 
date that users have access to the slices they claim and 
stores the files in a area accessible by https. The client 
tools come with the certificate for the Stork repository 
which pacman and stork use to securely download the 
public key and master configuration file for the slice. 
This allows users to change the master configuration 
file or public key on all nodes by simply adding the 
appropriate file to the Stork repository. 


Modularity 


Stork is highly modular and uses several inter- 
faces that allow its functionality to be extended to 
accommodate new protocols and package types: 


Transfer A transfer module implements a trans- 
port protocol. It is responsible for retrieving a particu- 
lar object given the identifier for that object. Transfer 
protocols currently supported by Stork include Co- 
Blitz [21], BitTorrent [9], Coral [12], HTTP, and FTP. 


Share A share module is used by the Stork nest 
to share files between VMs. It protects files from 
modification, maps content between slices, and au- 
thenticates client slices. Currently Stork supports Plan- 
etLab and Linux VServers. Using an extensible inter- 
face allows Stork to be customized to support new 
VM environments. 


Package A package module provides routines 
that the Stork client tools use to install, remove, and 
interact with packages. It understands several package 
formats (RPM, tar) and how to install them in the cur- 
rent system. 


Prepare A prepare module prepares packages 
for sharing. Preparing a package typically involves 
extracting the files from the package. The Prepare 
interface differs from the Package interface in that 
package install scripts are not run and databases (such 
as the RPM database) are not updated. The nest 
process uses the prepare module to ready the package 
files for sharing. 


Transfer Modules 


Transfer modules are used to download files 
from the Stork repository. Transfer modules encapsu- 
late the necessary functionality of a particular transfer 
protocol without having to involve the remainder of 
Stork with the details. 


Each transfer module implements a retrieve_files 
function that takes several parameters including the 
name of the repository, source directory on the reposi- 
tory, a list of files, and a target directory to place the 
files in. The transfer module is responsible for opening 
and managing any connections that it requires to the 
repositories. A successful call to retrieve_files returns a 
list of the files that were successfully retrieved. 

Transfer modules are specified to Stork via an 
ordered list in the main Stork configuration file. Stork 
always starts by trying the first transfer module in the 
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list. If this transfer module should fail or return a file 
that is old, then Stork moves on to the next module in 
the list. 


Content Retrieval Modules 


CoBlitz uses a content distribution network (CDN) 
called CoDeeN [33] to support large files transfers with- 
out modifying the client or server. Each node in the 
CDN runs a service that is responsible for splitting large 
files into chunks and reassembling them. This approach 
not only reduces infrastructure and the need for resource 
provisioning between services, but can also improve 
reliability by leveraging the stability of the existing 
CDN. CoBlitz demonstrates that this approach can be 
implemented at low cost, and provides efficient transfers 
even under heavy load. 


Similarly, the Coral module uses a peer-to-peer 
content distribution network that consists of volunteer 
sites that run CoralCDN. The CoralCDN sites auto- 
matically replicate content as a side effect of users 
accessing it. A file is retrieved via CoralCDN simply 
by making a small change to the hostname in an 
object’s URL. Then a peer-to-peer DNS layer trans- 
parently redirects browsers to nearby participating 
cache nodes, which in turn cooperate to minimize load 
on the origin web server. One of the system’s key 
goals is to avoid creating hot spots. It achieves this 
through Coral [12], a latency-optimized hierarchical 
indexing infrastructure based on a novel abstraction 
called a distributed sloppy hash table (DSHT). 


BitTorrent is a protocol for distributing files. It 
identifies content by URL and is designed to integrate 
seamlessly with the web. Its advantage over HTTP is 
that nodes that download the same file simultaneously 
also upload portions of the file to each other. This 
greatly reduces the load on the server and increases 
scalability. Nodes that upload portions of a file are 
called seeds. BitTorrent employs a tracker process to 
track which portions each seed has and helps clients 
locate seeds with the portions they need. BitTorrent 
balances seed loads by having its clients preferentially 
retrieve unpopular portions, thus creating new seeds 
for those portions. 


Stork also supports traditional protocols such as 
HTTP and FTP. These protocols contact the repository 
directly to retrieve the desired data object. It is prefer- 
able to use one of the content distribution networks 
instead of HTTP or FTP as it reduces the repository 
load. 


Stork supports all of these transfer mechanisms 
with performance data presented in the results section. 
One key observation is that although these transfer 
methods are efficient, the uncertainties of the Internet 
make failure a common case. For this reason the trans- 
fer module tries a different transfer mechanism when 
one fails. For example, if a BitTorrent transfer fails, 
Stork will attempt CoBlitz, HTTP, or another mecha- 
nism until the transfer succeeds or gives up. This 
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provides efficiency in the common case, and correct 
handling when there is an error. 


Nest Transfer 


In addition to the transfer modules listed above, 
Stork supports a nest transfer module. The nest trans- 
fer module provides an additional level of indirection 
so that the client asks the nest to perform the transfer on 
its behalf rather than performing the transfer directly. If 
the nest has a current copy of the requested item in its 
cache, then it can provide the item directly from the 
cache. Otherwise, the nest will invoke a transfer module 
(such as BitTorrent, HTTP, etc.) to retrieve the item, 
which it will then provide to the client and cache for 
later use. 


Push 


Stork supports metadata distribution to the nests 
using a publish/subscribe system [11]. In a publish/ 
subscribe system, subscribers register their interest in 
an event and are subsequently notified of events gen- 
erated by publishers. One such publish/subscribe sys- 
tem is PsEPR [5]. The messaging infrastructure for 
PsEPR is built on a collection of off-the-shelf instant 
messaging servers running on PlanetLab. PSEPR pub- 
lishes events (XML fragments) on channels to which 
clients subscribe. Behind the scenes PsEPR uses over- 
lay routing to route events among subscribers. 


The Stork repository pushes out metadata updates 
through PsEPR. It also pushes out the repository’s meta- 
hash file that contains the hashes of the metadata files; 
this serves as a heartbeat that allows nodes to detect 
missed updates. In this manner nodes only receive meta- 
data changes as necessary and there is no burden on the 
repository from unnecessary polling. 

Directory Synchronization 

In addition to pushing data, Stork also supports a 
mechanism for pulling the current state from a reposi- 
tory. There are several reasons why this might be nec- 
essary, with the most obvious being that the pub- 
lish/subscribe system is unavailable or has not pub- 
lished data in a timely enough manner. Stork builds 
upon the transfer modules to create an interface that 
supports the synchronization of entire directories. 


Directory synchronization mirrors a directory 
hierarchy from the repository to the client. It first 
downloads the repository’s metahash file (the same file 
that the repository publishes periodically using PsEPR). 
This file contains a list of all files that comprise the 
repository’s current state and the hashes for those files. 
Stork compares the hashes to the those of the most 
recent copies of these files that it has on disk. If a hash 
does not match, then the file must be re-downloaded 
using a transfer module. 


Share Modules 
Virtual machines are a double-edged sword: the 


isolation they provide can come at the expense of 
sharing between them. Sharing is used in conventional 
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systems to provide performance and resource utiliza- 
tion improvements. One example is sharing common 
application programs and libraries. They are typically 
installed in a common directory and shared by all 
users. Only a single copy of each application and 
library exists on disk and in memory, greatly reducing 
the demand on these resources. Supporting different 
versions of the same software is an issue, however. 
Typically multiple versions cannot be installed in the 
same common directory without conflicts. Users may 
have to resort to installing their own private copies, 
increasing the amount of disk and memory used. 


Stork enables sharing in a VM environment by 
weakening the isolation between VMs to allow file 
sharing under the control of the nest. Specifically, 
read-only files can be shared such that individual 
slices cannot modify the files, although they can be 
unlinked. This reduces disk and memory consumption. 
These benefits are gained by all slices that install the 
same version of a package. It also allows slices to 
install different package versions in the standard loca- 
tion in their file systems without conflict. 


In Stork, sharing is provided via Share modules 
that hide the details of sharing on different VM plat- 
forms. This interface is used by the nest and provides 
five routines: init_client, authenticate_client, share, pro- 
tect, and copy. Init_client is called when a client binds to 
the nest, and initializes the per-client state. Authenti- 
cate_client is used by the nest to authenticate the client 
that has sent a bind request. This is done by mapping a 
randomly named file into the client’s filesystem and 
asking it to modify the file in a particular way. Only a 
legitimate client can modify its local file system, and 
therefore if the client succeeds in modifying the file 
the nest requested, the nest knows that it is talking to a 
legitimate client. The share routine shares (or unshares) 
a file or directory between the client and nest, protect 
protects (or unprotects) a file from modification by the 
client, and copy copies a file between the nest and a 
client. 


The implementation of the Share module de- 
pends on the underlying platform. On PlanetLab the 
Share module communicates with a component of the 
VMM called Proper [20] to perform its operations. 
The nest runs in an unprivileged slice — all privileged 
operations, such as sharing, copying, and protecting 
files, are done via Proper. 


On the Vserver platform the nest runs in the root 
context, giving it full access to all VM file systems 
and allowing it to do all of its operations directly. Hard 
links are used to share files between VMs. The 
immutable bits are used to protect shared files from 
modification. Directories are shared using mount --bind. 
Copying is easily done because the root context has 
access to all VM filesystems. 


Package Modules 


Stork supports the popular package formats RPM 
and tar. In the future, other package formats such as 
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Debian may be added. Each type of package is encap- 
sulated in a package module. Each package module 
implements the following interfaces: 


is_package_understood. Returns true if this pack- 
age module understands the specified package type. 
Stork uses this function to query each package module 
until a suitable match is found. 


get_package_provides. Returns a list of dependen- 
cies that are provided by a package. This function is 
used to generate the metadata that is then used to 
resolve dependencies when installing packages. 


get_packages_requires. Returns a list of packages 
that this package requires. This function is used along 
with get_package_provides to generate the package 
metadata. 


get_package_files. Returns a list of the files that 
are contained in a package. This function is also used 
when generating package metadata. 


get_package_info. Returns the name, version, re- 
lease, and size of a package. This information allows 
the user to install a specific version of a package. 


get_installed_versions. Given the name of a pack- 
age, returns a list of the versions of the package that 
are installed. This function is used to determine when 
a package is already installed, so that an installation 
can be aborted, or an upgrade can be performed if the 
user has requested upgrades. 


execute_transactions. Stork uses a transaction-based 
interface to perform package installation, upgrade, and 
removal. A transaction list is an ordered list of package 
actions. Each action consists of a type (install, upgrade, 
remove) and a package name. 
Supported Package Types 

stork_rpm. Stork currently supports RPM and tar 
packages. The RPM database is maintained internally 
by the rpm command-line tool, and Stork’s RPM pack- 
age module uses this tool to query the database and to 
execute the install, update, and remove operations, 


stork_tar. Tar packages are treated differently be- 
cause Linux does not maintain a database of installed 
tar packages, nor is there a provision in tar packages 
for executing install and uninstall scripts. Stork allows 
users to bundle four scripts, .preinstall, .postinstall, .prere- 
move, .postremove that are executed by Stork at the 
appropriate times during package installation and re- 
moval. Stork does not currently support dependency res- 
olution for tar packages, but this would be a straightfor- 
ward addition. Stork maintains a database that contains 
the names and versions of tar packages that are installed 
that mimics the RPM database provided by the rpm tool. 
Nest Package Installation 

A special package manager, stork_nest_rpm, is 
responsible for performing shared installation of RPM 
packages. Shared installation of tar packages is not 
supported at this time. Performing a share operation is 
a three-phase process. 
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In the first phase, stork_nest_rpm calls stork_rpm 
to perform a private installation of the package. This 
allows the package to be installed atomically using the 
protections provided by RPM, including executing any 
install scripts. In the second phase, stork_nest_rpm con- 
tacts the Stork nest and asks it to prepare the package 
for sharing. The prepare module is discussed in the 
following section. Finally, in the third phase stork_ 
nest_rpm contacts the nest and instructs it to share the 
prepared package. The nest uses the applicable share 
module to perform the sharing. The private versions of 
files that were installed by stork_rpm are replaced by 
shared versions. Stork does not attempt to share con- 
figuration files because these files are often changed 
by the client installation. Stork also examines files to 
make sure they are identical prior to replacing a pri- 
vate copy with a shared copy. 


Removal of packages that were installed using 
stork_nest_rpm requires no special processing. stork_ 
nest_rpm merely submits the appropriate remove actions 
to stork_rpm. The stork_rpm module uses the rpm tool to 
uninstall the package, which unlinks the package’s files. 
The link count of the shared files is decremented, but is 
still nonzero. The shared files persist on the nest and in 
any other clients that are linked to them. 


Prepare Modules 


Prepare modules are used by the nest to prepare a 
package for sharing. In order to share a package, the 
nest must extract the files in the package. This extrac- 
tion differs from package installation in that no instal- 
lation scripts are run, no databases are updated, and 
the files are not moved to their proper locations. 
Instead, files are extracted to a sharing directory. 


Prepare modules only implement one interface, 
the prepare function. This function takes the name of a 
package and the destination directory in which to 
extract the package. 


RPM is the only package format that Stork cur- 
rently shares. The first step of the stork_rpm_prepare 
module is to see if the package has already been pre- 
pared. If it has, then nothing needs to be done. If the 
package has not been prepared, then stork_rpm_prepare 
uses rpm2cpio to convert the RPM package into a cpio 
archive that is then extracted. stork_rpm_prepare queries 
the rpm tool to determine which files are configuration 
files and moves the configuration files to a special loca- 
tion so they will not be shared. Finally, stork_rpm_pre- 
pare sets the appropriate permissions on the files that it 
has extracted. 


Stork Walkthrough 


This section illustrates how the Stork compo- 
nents work together to manage packages using the ear- 
lier example in which an administrator installs an 
updated version of the foobar package on the VMs the 
company uses for testing and on the non-VM desktop 
machines used by the company’s developers. 


ll. 


12. 
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. The administrator uses storkutil to add the new 


version of the foobar package to her TP file. 


. She uses storkutil to add the groups Devel and 


Test to her groups.pacman file, representing the 
developer’s end systems and the testing VMs, 
respectively. Since groups can be reused, this 
step most likely would have been done previ- 
ously. 


. The administrator uses storkutil to add a line to 


her packages.pacman file instructing the Test 
group to update foobar. She does the same for 
the Devel group. 


. Storkutil automatically signed these files with 


her private key. She now uploads these files to 
a Stork repository. If the new version of the foo- 
bar package is not already on the repository she 
uploads this as well. 


. The repository treats the TP and pacman files 


similarly. The signatures are verified using the 
administrator’s public key that is embedded in 
the file name. The new files replace the old if 
their signatures are valid and their timestamps 
newer. The foobar package is stored in a direc- 
tory whose name is its secure hash. The pack- 
age metadata is extracted and made available 
for download. 


. The repository uses the publish/subscribe sys- 


tem PsEPR to push out a new repository meta- 
hash to the VMs. 


. The VMs are running stork_receive_update and 


obtain the new repository metahash. The stork_ 
receive_update daemon wakes up the pacman 
daemon. 


. The pacman daemon updates its metadata. On 


non-VM platforms, the files are downloaded 
efficiently using whatever transfer method is 
listed in the Stork configuration file. On VM 
platforms, pacman retrieves the files through the 
nest (which means the files are downloaded 
only once per physical machine). 


. Pacman processes its metadata and if the cur- 


rent VM is in either the Test or Devel groups it 
calls stork to update the foobar package. 


. The stork tool verifies that it has the current 


metadata and configuration files. This is useful 
because it is not uncommon for several files to 
be uploaded in short succession. If this is not 
the case it retrieves the updated files in the 
same manner as pacman. 

Stork verifies that the specified version of foobar 
is not already installed; if it is, Stork simply exits. 
Stork searches the package metadata for the 
specified package. If no candidate is found then 
it exits with an error message that the package 
cannot be found. Multiple candidates may be 
returned if the metadata database contains sev- 
eral versions of foobar. 


. Stork verifies that the user trusts the candidate 


versions of foobar. It does this by applying the 
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rules from the user’s TP file one at a time until 
a rule is found that matches each candidate. If 
the rule is a DENY rule, then the candidate is 
rejected. If the rule is an ACCEPT rule, then 
the candidate is deemed trustworthy. The result 
of trust verification is an ordered list of package 
candidates. 

14. Stork now has one or more possible candidates 
for foobar. However, if foobar depends on other 
packages stork repeats steps 13-17 for the de- 
pendencies to determine if those dependencies 
can be satisfied. 

15. Stork now has a list of packages that are to be 
updated, including foobar and its missing depen- 
dencies. Stork uses a transfer module to retrieve 
foobar and dependent packages. The highest pri- 
ority transfer method is to contact the reposi- 
tory, which is via the nest in VM environments. 

16. In a VM environment the nest receives the 
requests for foobar and its dependencies from 
the client VM. If these files are already cached 
on the nest, then the nest provides those local 
copies. If not, then the nest invokes the transfer 
modules (BitTorrent, CoBlitz, etc.) to retrieve 
the files. When retrieval is complete, the nest 
shares the package with the client VM. 

17. Stork now has local copies of foobar and its 
dependent packages. The client queries the 
package modules to find one that can install the 
package. In non-VM environments the stork_ 
rpm module installs the packages using RPM 
and returns to stork which exits. In VM environ- 
ments the stork_nest_rpm module is tried first 
(stork will fail over and use stork_rpm if this 
module fails). Because foobar is an RPM pack- 
age, stork_nest_rpm can process it. Stork builds a 
transaction list and passes it to the execute_ 
transactions function of stork_nest_rpm 

18. In a VM environment the stork_nest_rpm mod- 
ule passes the transaction list to stork_rpm in 
order to install a private non-shared copy of the 
foobar package. 

19. Ina VM environment the stork_nest_rpm mod- 
ule then contacts the nest and issues a request 
to prepare and share foobar. The nest uses the 
appropriate prepare module to extract the files 
contained in foobar. The nest uses the appropri- 
ate share module to share the extracted files 
with the client VM. Sharing overwrites the pri- 
vate versions of the files in the client’s VM 
with shared versions from the foobar package. 


In some cases there will be systems that do not 
receive the PsEPR update. This could occur because 
PsEPR failed to deliver the message or perhaps 
because the system is down. If PsEPR failed then pac- 
man check for updates every five minutes. If the system 
was down then when it restarts pacman will run. Either 
way pacman will start and obtain a new repository 
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metahash and the system will continue the process from 
Step 8. 


If nest or module failures happen, stork fails over 
to other modules that might be able to service the 
request. For example, if the packages cannot be down- 
loaded by BitTorrent, the tool will instead try another 
transfer method like CoBlitz as specified in the master 
configuration file. 


Results 


Stork was evaluated via several experiments on 
PlanetLab. The first measures the effectiveness of 
Stork in conserving disk space when installing pack- 
ages in VM environments. The second experiment 
measures the memory savings Stork provides to pack- 
ages installed in multiple VMs. The final set of exper- 
iments measure the impact Stork has on package 
downloads both in performance and in repository 
load. 


Disk Usage 


The first experiment measured the amount of 
disk space saved by installing packages using Stork 
versus installing them in client slices individually 
(Figure 5). These measurements were collected using 
the 10 most popular packages on a sample of 11 Plan- 
etLab nodes. Some applications consist of two pack- 
ages: one containing the application and one contain- 
ing a library used exclusively by the application. For 
the purpose of this experiment they are treated as a 
single package. 


Package | Disk Space (KB) | Percent 
Name Standard | Stork | Savings 
jscriptroute | 8644 | 600 | 93% _| 
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Figure 5: Disk Used by Popular Packages. This ta- 
ble shows the disk space required to install the 10 
most popular packages installed by the slices on a 
sampling of PlanetLab nodes. The Standard col- 
umn shows how much per-slice space the package 
consumes if nothing is shared. The Stork column 
shows how much per-slice space the package 
requires when installed by Stork. 
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For all but one package, Stork reduced the per- 
client disk space required to install a package by over 
90%. It should be noted that the nest stores an entire 
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copy of the package to which the clients link; Stork’s 
total space savings is therefore a function of the total 
number of clients sharing a package. 


One package, j2re, had savings of only 45%. This 
was because many of the files within the package were 
themselves inside of archives. The post-install scripts 
extract these files from the archives. Since the post- 
install scripts are run by the client, the nest cannot 
share the extracted files between slices. By repackag- 
ing the files so that the extracted files are part of the 
package, this issue can be avoided. 


Memory Usage 


Stork also allows processes running in different 
slices to share memory because they share the under- 
lying executables and libraries (Figure 6). The primary 
application was run from each package and its mem- 
ory usage was analyzed. It was not possible to get 
memory sharing numbers directly from the Linux ker- 
nel running on the PlanetLab nodes. Since the Planet- 
Lab kernel shares free memory pages between VMs 
and there are many VMs being used by different users 
on each PlanetLab node, this increases the difficulty of 
gathering accurate memory usage information. 


To obtain approximate results the pmap com- 
mand was used to dump the processes’ address spaces. 
Using the page map data, it is possible to classify 
"memory regions as shared or private. The results are 
only approximate, however, because the amount of 
address space shared does not directly correspond to 
the amount of memory shared as some pages in the 
address space may not be resident in memory. More 
accurate measurements require changes to the Linux 
kernel that are not currently feasible. 


Another difficulty in measuring memory use is 
that it changes as the program runs. Daemon programs 
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were simply started and measured. Applications that 
process input files (such as java and make) were started 
with a minimal file that goes into an infinite loop. The 
remaining applications printed their usage information 
and were measured before they exited. 


The resulting measurements show that Stork typ- 
ically reduces the memory required by additional pro- 
cesses by 50% to 60%. There are two notable excep- 
tions: named and java. These programs allocate huge 
data areas that are much larger than their text seg- 
ments and libraries. Data segments are private, so this 
shadows any benefits Stork provides in sharing text 
and libraries. 


Package Retrieval 


Stork downloads packages to the nest efficiently, 
in terms of the amount of network bandwidth required, 
server load, and elapsed time. This was measured by 
retrieving a 10 MB package simultaneously from 300 
nodes (Figure 7), simulating what happens when a 
new package is stored on the repository. Obviously 
faulty nodes were not included in the experiments, and 
a new randomly-generated 10 MB file was used for 
each test. Each test was run three times and the results 
averaged. It proved impossible to get all 300 nodes to 
complete the tests successfully; in some cases some 
nodes never even started the test. Faulty and unrespon- 
sive nodes are not unusual on PlanetLab. This is dealt 
with by simply reporting the number of nodes that 
started and completed each test. 


Repository load is important to system scalabil- 
ity, represented as the total amount of network traffic 
generated by the repository. This includes retransmis- 
sions, protocol headers, and any other data. For Bit- 
Torrent, this includes the traffic for both the tracker 
and the initial seed as they were run on the same node; 
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Figure 6: Memory Used by Popular Packages. Packages installed by Stork allow slices to share process memory. 
The Standard column shows how much memory is consumed by each process when nothing is shared. With 
Stork the first process will consume the same amount as the Standard column, but additional processes only 


require the amount shown in the Stork column. 
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running them on different nodes made negligible dif- 
ference. At a minimum the repository must send 10 
MB, since the clients are downloading a 10 MB file. 
CoBlitz generated the least network traffic, sending 
7.8 times the minimum. BitTorrent sent 3.3 times as 
much data as CoBlitz and Coral sent 5.5 times as 
much as CoBlitz. HTTP was by far the worst, sending 
39.5 times more than CoBlitz. In fact, HTTP exceeded 
the product of the number of clients and the file size 
because of protocol headers and retransmissions. 


For each test the amount of useful bandwidth 
each client received (file data exclusive of network 
protocol headers) is reported, including both the me- 
dian and mean, as well as the 25th and 75th per- 
centiles. BitTorrent’s mean bandwidth is 2.8 times that 
of CoBlitz, 3.3 times that of HTTP, and 4.2 times that 
of Coral. HTTP does surprisingly well, which is a 
result of a relatively high-speed connection from the 
repository to the PlanetLab nodes. 


Figure 8 shows the cumulative distribution of 
client completion times. More than 78% of the nodes 
completed the transfer within 90 seconds using Bit- 
Torrent, compared to only 40% of the CoBlitz and 
23% of the Coral nodes. None of the HTTP nodes fin- 
ished within 90 seconds. 


The distribution of client completion times also 
varied greatly among the protocols. The time of HTTP 
varied little between the nodes: there is only an 18% 
difference between the completion time of the 25th 
and 75th percentiles. The BitTorrent clients in the 25th 
percentile finished in 48% the time of clients in the 
75th percentile, while Coral clients differed by 64%. 
CoBlitz had the highest variance, so that the clients in 
the 25th percentile finished in 14% of the time of the 
clients in the 75% percentile, meaning that the slowest 
nodes took 7.3 times as long to download the file as 
the fastest. 


These results reflect how the different protocols 
download the file. All the nodes begin retrieving the 
file at the same time. Clients in BitTorrent favor down- 
loading rare portions of the file first, which leads to 
most of the nodes downloading from each other, rather 
than from the repository. The CoBlitz and Coral CDN 
nodes download pieces of the file sequentially. This 
causes the clients to progress lock-step through the file, 
all waiting for the CDN node with the next piece of the 
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file. This places the current CDN node under a heavy 
load while the other CDN nodes are idle. 


Percent Nodes Completed 


BitTorrent 





Time (sec) 

Figure 8: Elapsed Time. This graph shows the cu- 
mulative distribution of client completion times. 
Only nodes that successfully completed are in- 
cluded. 


Based on these results Stork uses BitTorrent as 
its first choice when performing package retrievals, 
switching to other protocols if it fails. BitTorrent 
decreased the transfer time by 70% over over HTTP 
and reduces the amount of data that the repository 
needs to send by 92%. 


Related Work 


Prior work to address the problem of software 
management can be roughly classified into three cate- 
gories: (a) traditional package management systems 
which resolve package dependencies and retrieve pack- 
ages from remote systems, (b) techniques to reduce the 
cost of duplicate installs, and (c) distributed file systems 
that are used for software distribution. 


Traditional Package Management Systems 


Popular package management systems [2, 10, 27, 
34, 36] typically retrieve packages via HTTP or FTP, 







Figure 7: Package Download Performance. This table shows the results of downloading a 10 MB file to 300 
nodes. Each result is the average of three tests. The client bandwidth is measured with respect to the amount of 
file data received, and the mean, median, 25th percentile, and 75th percentile results given. The Nodes Com- 
pleted column shows the number of nodes that started and finished the transfer. The Server MB Sent is the 
amount of network traffic sent to the clients, including protocol headers and retransmissions. 
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resolve dependencies, and manage packages on the 
local system. They do not manage packages across 
multiple machines. This leads to inefficiencies in a 
distributed VM environment because a service spans 
multiple physical machines, and each physical ma- 
chine has multiple VMs. The package management 
system must span nodes and VMs, otherwise VMs 
will individually download and install packages, con- 
suming excessive network bandwidth and disk space. 


Most package management systems have support 
for security. In general, however, the repository is 
trusted to contain valid packages. RPM and Debian 
packages can be signed by the developer and the signa- 
ture is verified before the package is installed. This 
requires the user to have the keys of all developers. In 
many cases package signatures are not checked by 
default because of this difficulty. The trustedpackages 
file mechanism in Stork effectively allows multiple sig- 
natures per package so that users require fewer keys. 


Reducing the Cost of Duplication 


Most VMMs focus on providing isolation be- 
tween VMs, not sharing. However different techniques 
have been devised to mitigate the disk, memory, and 
network costs installing duplicate packages. 


Disk A good deal of research has gone into pre- 
venting duplicate data from consuming additional disk 
space. For example, many file systems use copy-on- 
write techniques [6, 8, 14, 15, 16, 30] which allow data 
to be shared but copied if modified. This allows differ- 
ent “snapshots” of a file system to be taken where the 
unchanged areas will be shared amongst the “snap- 
shots”. However, this does not combine identical files 
that were written at different locations (as would happen 
with multiple VMs downloading the same package). 


Some filesystem tools [4] and VMMs [17, 18] 
share files that have already been created on a system. 
They unify common files or blocks to reduce the disk 
space required. This unification happens after the 
package has been installed; each VM must download 
and install the package, only to have its copies of the 
files subsequently replaced with links. Stork avoids 
this overhead and complexity by linking the files in 
the first place. 


Another technique for reducing the amount of 
storage space consumed by identical components de- 
tects duplicate files and combines them as they are 
written [25]. This is typically done by using a hash of 
the file blocks to quickly detect duplicates. Stork 
avoids the overhead of needing to check file blocks for 
duplicates on insertion and avoids the need to down- 
load the block multiple times in the first place. 


Memory There are many proposals that try to 
reduce the memory overhead of duplicate memory 
pages. Disco [6] implements copy-on-write memory 
sharing between VMs which allows not only a pro- 
cess’ memory pages to be shared but also allows 
duplicate buffer cache pages to be shared. The sharing 
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provided by Stork is much less effective than Disco, 
but at a much lower cost. 


Stork allows VMs to share the memory used by 
shared applications and libraries. VMware ESX Server 
[32] also allows VMs to share memory, but does so 
based on page content. A background process scans 
memory looking for multiple copies of the same page. 
Any redundant copies are eliminated by replacing 
them with a single copy-on-write page. This allows for 
more potential sharing than Stork, as any identical 
pages can be shared, but at the cost of having pro- 
cesses create duplicate pages only to have them culled. 


Network Bandwidth A common technique to 
mitigate the network costs of duplicate data retrieval is 
to use a proxy server [7, 26, 28, 35]. Proxy servers 
minimize the load on the server providing the data and 
also increase the performance of the clients. However, 
the data still must be transfered multiple times over the 
network, while the Stork nest provides the data to the 
client VMs without incurring network traffic (much like 
each system running its own proxy server for packages). 
Stork uses techniques such as P2P file dissemination [9] 
along with proxy based content retrieval [22, 12] to 
minimize repository load. 


Distributed File Systems 


Stork uses content distribution mechanisms to 
download packages to nodes. Alternatively, a distributed 
file system such as NFS could be used. For example, the 
relevant software package files could be copied onto a 
file system that is shared via NFS. There are many 
drawbacks to this technique including poor performance 
and the difficulty in supporting different (and existing) 
packages on separate machines. 


Among the numerous distributed files systems 
Shark [1] and SFS-RO [13] are two that have been 
promoted as a way to distribute software. Clients can 
either mount applications and libraries directly, or use 
the file system to access packages that are installed 
locally. The former has performance, reliability, and 
conflict issues; the latter only uses the distributed file 
system to download packages, which may not be supe- 
rior to using an efficient content distribution mecha- 
nism and does not provide centralized control and 
management. 


Conclusion 


Stork provides both efficient inter-VM package 
sharing and centralized inter-machine package man- 
agement. When sharing packages between VMs it typ- 
ically provides over an order of magnitude in disk sav- 
ings, and about 50% of the memory costs. Addition- 
ally, each node needs only download a package once 
no matter how many VMs install it. This reduces the 
package transfer time by 70% and reduces the reposi- 
tory load by 92%. 


Stork allows groups of VMs to be centrally 
administered. The pacman tool and its configuration 
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files allow administrators to define groups of VMs and 
specify which packages are to be installed on which 
groups. Changes are pushed to the VMs in a timely 
fashion, and packages are downloaded to the VMs 
efficiently. Stork has been in use on PlanetLab for 
over four years and has managed thousands of virtual 
machines. The source code for Stork may be down- 
loaded from http://www.cs.arizona.edu/stork 
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ABSTRACT 


Management of virtual machines (VMs) on a large scale remains a significant challenge 
today. We lack general and vendor-independent quantitative criteria/metrics by which to describe 
the state of infrastructures for virtualization. This data is essential to both expressing ad- 
ministrative policy goals and to measure ongoing compliance with them in production settings. 


In this work, we consider VM management in production environments. We investigate VM 
applicability to real-world scientific computing problems by comparing application performance 
in a controlled environment of physical servers with the Manage Large Networks (MLN) tool’s 
implementation of the same scenario on virtual machines. Based on our observations, we propose 
three new metrics by which to describe and analyze the infrastructure in order to incorporate 
virtual machine management closer into policy. The metrics have been implemented into the latest 
release of our MLN tool for virtual machine management. 


Introduction 


Most approaches to virtualization devote the 
bulk of the efforts on their features related to deploy- 
ment and fail-over mechanisms. Indeed, much of the 
work in this area focuses on these areas exclusively. 
However, after running a virtual infrastructure in pro- 
duction, one discovers how deployment decisions and 
virtual machine behavior affect the overall perfor- 
mance in ways which are not completely described 
nor obviously transparent. One key challenge is avoid- 
ing resource conflicts among the virtual machines. 
This task depends on many local factors, such as the 
number and behavior of the virtual machines, as well 
as the number and capacity of physical servers and 
other infrastructure resources like common storage. 
We use the term “‘virtual(ized) infrastructure” as a 
general scenario where more than one physical servers 
are used to host a range of virtual machines with vary- 
ing life-span and purposes. We do not intend to refer 
to a particular virtual machine managment framework 
or product. 


The system administrator needs methods to ana- 
lyze and describe the site’s virtualization infrastructure 
in a technology independent way. Such analysis would 
assist in the following tasks: 

¢ Determine the level of redundancy in server 
capacity for downtime planning. 
Review the level of resource conflicts between 
virtual machines in order to identify and re- 
move bottlenecks. 
Find the optimal server location for new virtual 
machines in the infrastructure. 
Identify which virtual machines are apt to de- 
mand resources at the same time in order to 
separate them from each other. 


Possessing this and related data will also enable 
the system administrator to express and implement 
explicit, quantitative policy rules for the site, enabling 
her to address a variety of concerns: How should one 
describe the desired level of redundancy in server 
capacity and measure its compliance? Could a level of 
resource conflicts function as an indication of how 
well the virtual machines are deployed across the site? 
This is not only valuable for system administrators 
today, but necessary steps towards autonomic capabili- 
ties of future management tools, since self-optimizing 
system behavior techniques require quantifiable met- 
rics in order to measure success. 


MLN (Manage Large Networks) [1] is a virtual 
machine management tool developed at the University 
College of Oslo. It supports both Xen and User-Mode 
Linux. After using MLN in various scenarios and 
addressing the need to efficiently deploy large scenar- 
ios of virtual machines [2, 3, 4], we have arrived at the 
point where long-time management reveals new chal- 
lenges, problems which are still unexplored by the 
community. This article addresses these challenges by 
proposing three methods by which to analyze a virtu- 
alized infrastructure. The Server redundancy level, 
Resource conflict matrix and Location conflict table 
are technology independent measures of the state of 
the infrastructure. 


This text is organized as follows: We first dem- 
onstrate the viability of using Xen virtual machines 
and MLN in a production environment. This case 
study demonstrated to us the challenges of long-time 
management of virtual machines. Next, our approaches 
for analysis are presented and discussed in detail. We 
then discuss the implementation of these in MLN and 
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discuss our findings to date. Finally, we outline future 
work and review related work. 


VM Viability in Production Scientific Computing 
Environments 


In recent years, there have been many calls for the 
increased use of virtualization technologies for high 
performance computing (HPC) and especially scientific 
modeling and simulations (see, e.g., [5, 6]). The many 
advantages of virtualization and virtual machines (VMs) 
usually apply to such specialized computing environ- 
ments as they do to general purpose systems (e.g., the 
ability to control resource usage, security barriers, per- 
formance and reliability improvements due to isolation 
of distinct processes, and so on). 


The computing environment and typical job 
characteristics of scientific computing environments, 
whether located in academic institutions, corporate 
research and development or government laboratories, 
share many common traits which distinguish them 
from more general computing environments: 

e Rather than a mix of many jobs of various 
types and needs, scientific computing typically 
consists of a few computations of significant 
duration. The most intensive of scientific com- 
puting applications have essentially unbounded 
need for CPU cycles, physical memory and 
memory bandwidth, and, in some cases, disk 
and/or network I/O capacity. 

° Research efforts tend to rely on a few software 
packages related to the field under investiga- 
tion. Most production software is commercial 
for which source code is not available (and the 
expertise required to modify it is not present at 
most sites). 

e Research groups prefer to have their own com- 
puting resources, with typically little or no sys- 
tem administrative support from any centralized 
IT organization. 


Such environments face many challenges which 
virtualization technology, in conjunction with a tool 
like MLN, can address: 

© Deploying idle computing resources, where and 
when they are available. This can include ap- 
plying general purpose computers to simulation 
problems after hours. VMs allow the host oper- 
ating system environment to remain unaffected 
by allowing scientific computing usage of the 
hardware. 

e Since scientific computations can execute for 
hours, days or even weeks, being able to start, 
pause, restart and migrate such jobs is very 
beneficial. VMs and MLN provide this capabil- 
ity to any application which employs their re- 
sources, often adding this valuable feature for the 
first time. For example, Gaussian 03, the produc- 
tion code we employed in performance tests, has 
only limited job restarting capabilities. MLN 


Begnum, Disney, Frisch, & Mevag 


allows jobs to be paused at any point. In the 
past, such capabilities were present on in spe- 
cial purpose checkpointing libraries [7, 8]. 

¢ Multiple operating system environments — even 
legacy ones — can be maintained and invoked 
on the same hardware. This is important in this 
arena in that distinct software packages typically 
have limited, and often contradictory, operating 
system version support (often lagging well be- 
hind the current releases). MLN allows distinct 
OS environments to be set up and cloned easily, 
ready to be reused as often and for as long as 
needed. 


Despite these very real benefits, however, perfor- 
mance is still the most important consideration in sci- 
entific computing, so virtualization will need to come 
with few associated resource costs in order to be 
adopted. Previous work has focused on the perfor- 
mance achieved on standard benchmarks [9, 10, 11, 
12]. While this data is interesting, and in general 
encouraging with respect to VM use for HPC, the lim- 
itations of benchmarks in modeling actual scientific 
computing are well known. First, such benchmarks 
tend to focus on single performance metrics in isola- 
tion: CPU speed, memory bandwidth, disk I/O, net- 
work message passing, and so on. Secondly, the algo- 
rithms employed in the computational portions are 
among the simplest of those used in real applications 
(e.g., Linpack). In addition, the results of the more 
complex, higher level benchmarks (e.g., the SPEC 
suite, the GCM program used in [10], and the like), 
the results are reduced to a single metric, megaflops 
achieved, which has been repeatedly shown to bear 
only a vaguely proportionate relationship to actual 
production performance. Finally, and often most im- 
portantly, the problem size is far too small to be useful 
or representative of actual computations and comput- 
ing requirements. 


Accordingly, we chose to conduct some tests 
with actual production code on modest sized but real- 
istic problems. We used the Gaussian 03 computa- 
tional chemistry package. Gaussian 03 performs elec- 
tronic structure calculations, modeling the properties 
of chemical compounds and reactions. It is widely 
used by academic and industrial chemists, chemical 
engineers, biochemists, physicists, and materials sci- 
entists throughout the world, addressing one of the key 
Grand Challenge level problems. Computationally, it 
is a very demanding and rigorous application whose 
achieved performance depends on the combination of 
CPU performance, memory bandwidth and, for some 
simulations, disk I/O transfer rates. 


We ran three jobs of increasing CPU require- 
ments in several ways: under a standard Red Hat 
Enterprise Linux kernel, directly under a Xen version 
of that operating system, and in a Xen virtual machine 
(details, which are probably understandable only to 
chemists, are given in a separate section below). We 


96 21st Large Installation System Administration Conference (LISA ’07) 


Begnum, Disney, Frisch, & Mevag 


also ran the jobs on a single node and in a parallel 
mode using three nodes; the parallel computations 
were performed under the Xen-enabled operating sys- 
tem and in VMs. In this way, we could test the perfor- 
mance effects of both aspects of virtualization: the 
modified Linux kernel and using a VM itself. The three 
nodes chosen for the parallel computations were quite 
different in performance characteristics and available 
resources. This selection was made to model the perfor- 
mance that might be obtained from drawing together 
idle systems in an ad hoc manner after hours. However, 
parallel performance is known to be best when using 
symmetric nodes; this approach thus represents a worst- 
case scenario in that the performance of the more pow- 
erful systems is reduced to that of the weakest node. 

The following table gives the parallel speedups 
obtained in the two environments (comparisons are 
with respect to a single node of the same type): 


Parallel Speedups (1 vs. 3 Nodes) 


Job Xen VM 


2.1 2.1 
2 2.8 2.8 
3 1.2 1.3 


Table 1: Parallel speedup comparison. 





Job 1, the shortest job, obtains reasonable paral- 
lel speedups, and Job 2 does quite well (as the maxi- 
mum speedup is 3.0); both of these jobs have few I/O 
requirements. This is in contrast to Job 3, chosen 
because of it substantial I/O requirements. Even in this 
case, however, parallelization provides some perfor- 
mance benefits. For our purposes, however, the key 
result in the preceding table is that using virtual 
machines for the computations produced identical per- 
formance to the jobs run directly on the hardware. 
Using a VM had no adverse affect on parallelization 
efficiency. 


The following table explores the overhead asso- 
ciated with virtualization — and specifically the Xen 
approach — in more detail: 


Virtualization Overhead 


Job XenoverRHEL VM over RHEL 


3.8% 0.6% 
3.9% 0.8% 
7.9% 


Table 2: Virtual overhead comparison. 









1 
2 
3 


The columns compare the performance for single 
processor jobs in the three environments: directly on 
the hardware with a standard kernel (RHEL), directly 
on the hardware with a Xen-enabled kernel (Xen) and 
in a Xen VM. The table indicates that when disk I/O is 
not a factor, then there is only a quite small perfor- 
mance penalty associated with the Xen operating sys- 
tem and virtually none associated with using a VM, 
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both compared to the vanilla Linux OS. Interestingly, 
the Xen kernel itself slightly less efficient than the 
vanilla operating system running ina VM. 


When disk I/O is a significant factor, the results 
are somewhat different. The overhead associated with 
the Xen-enabled OS remains more-or-less constant, 
but there is addition associated with running in a VM 
(at least 4.8% for this job). The total overhead of 
about 8% is still quite acceptable, but reducing that 
level would be desirable. Our initial tests focused on 
making virtualization configuration and use as easy as 
possible. Thus, the VM was built on a logical volume. 
In the jobs running directly under the Xen-enabled 
OS, however, disk I/O associated with the computa- 
tion was to a logical volume as well, and the LVM 
undoubtedly imposed some overhead. 


Future work will look at minimizing I/O over- 
head in the VM environment while still retaining VM 
configuration and build simplicity. However, these 
results show clearly that performance in an MLN- 
managed virtual environment is quite acceptable for 
production Gaussian 03 calculations. 


New Challenges and Approaches 


Our infrastructure consisted of a total of 11 
servers spread over two separate locations. Although 
the performance levels of Xen were acceptable, we 
encountered management issues which revealed new 
challenges: 

e Previous work has shown that there is a sub- 
stantial performance degradation if two virtual 
machines compete for the same resources on 
the same server [3]. We encountered difficulties 
avoiding such conflicts due to lack of overview 
over the entire infrastructure. 

At one point, one of the servers became unsta- 
ble and crashed. This taught us a valuable les- 
son because we ended up with insufficient 
capacity to host all of our virtual machines until 
the problem was fixed. We believe that a way 
to plan ahead for these eventualities would 
avoid the same situation in the future. 

There was no decision support from MLN to 
find the optimal placement for a new virtual 
machine project. We had to manually inspect 
the other running projects to arrive at a solu- 
tion. 


We believe that these challenges are universal for 
all system administrators who have to manage suffi- 
ciently many servers and virtual machines. Let us first 
describe a virtualized infrastructure in a more general- 
ized way. 

The foundation of a virtualized infrastructure is 
its physical servers, for which we will use the term 
server, and the virtual machines (VMs). We make two 
basic assumptions: 

1. The infrastructure consists of more than one 
physical server. 
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2. The infrastructure enables the system admini- 
strator to move virtual machines between the 
physical servers. 


What is left is to find the optimal placement of 
the virtual machines such that the load is evenly dis- 
tributed and that all the VMs experience a satisfactory 
level of performance. This process is highly dependent 
on the organizations context. The resulting provision- 
ing policy will act as a local guidebook for decision 
making or a policy for which an expert system may 
surveil and tune the infrastructure. 


It is essentially the system administrator’s duty 
to re-provision virtual machines for the best perfor- 
mance. This manual process depends on information 
about the infrastructure and the individual virtual 
machines. In some cases, knowledge about the virtual 
machine can help the decision process. Is it planned to 
be a web-service or a shell server? Can we expect per- 
formance peaks at specific times? Over what time 
period will the virtual machine run? This information 
may be hidden from the administrator if users can cre- 
ate their own virtual machines independently. What 
remains is the static hardware description of the vir- 
tual machines resource consumption and its perfor- 
mance profile. We will take this perspective in this 
text, assuming that we have no prior knowledge about 
virtual machine roles. We will, however assume that 
the administrator is at liberty to re-provision the vir- 
tual machines to different physical servers. 


Virtual Machine Resources 


A virtual machine can be described in two com- 
plementing ways. The first is through its static at- 
tributes which are defined in the virtual machines 
design. Examples are the amount of memory, number 
of CPUs and the placement of its filesystem locally or 
ona SAN. These design decisions influence the virtual 
machines performance and are for the most part static 
variables. Once the virtual machine is running, we get 
a series of dynamic variables describing how the vir- 
tual machine performs over time. We get the CPU 
consumption, network traffic, IO operations and pro- 
cess interrupts, to name a few. 


It is easy to plan ahead for even distribution 
among the static resources. Memory usually domi- 
nates these decisions — there is only so much to go 
around. The CPU, on the other hand, can be shared, 
and on a multi-CPU server, virtual machines may even 
run along side each other without conflict. Too many 
virtual machines demanding CPU time at once will 
throttle the performance of all. But in order to avoid 
this, we need to have knowledge about which virtual 
machines are troublemakers in order to isolate them. 
For small sites with a convenient number of virtual 
machines, intuition and random observation may be 
sufficient for a satisfactory end-result. But what when 
the amount of servers and virtual machines exceeds 
what is practical for manual analysis? On larger sites, 
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the dynamic resources need to be observed as time- 
series variables and stored for analysis. 

° Static resources belong to the server and are 
allocated for each VM at both boot and build 
time. Examples of static resources are: 

o Disk size 

o Filesystem placement 
o Memory 

o Virtual CPUs 

° Dynamic resources are consumed by the VM 
at run-time. They are subject to rapid change. 
Examples of dynamic resources are: 

o CPU seconds 
o Network traffic 
o IO operations 


The Server Redundancy Level 


Uptime and service availability are strong argu- 
ments for virtualization. But this depends strongly on 
the infrastructures stability and capability to re-provi- 
sion virtual machines. Several virtual machines on the 
same server means increased pressure on the server 
not to fail. If it would show signs of failure, like mes- 
sages about a failing disk, the virtual machines must 
be re-provisioned as quickly as possible in order to 
take down the sever for maintenance. A server that 
crashes with virtual machines still on is a disaster 
which is simply described as the all-the-eggs-in-one- 
basket effect. There is therefore much effort required 
by the system administrator in order to know when 
and which server can be taken down. 


Live migration is one of the most attractive fea- 
ture when it comes to re-provisioning a virtual ma- 
chine. It enables a virtual machine to move to a differ- 
ent server without loss in downtime. Other methods of 
migration exist too, like cold migration, where the vir- 
tual machine is shut down before it is moved. The lat- 
ter approach usually assumes that the two servers do 
not share a common storage for the VMs filesystem 
and it has to be transported to the new server. 


In both cases, however, the assumption is that 
there is enough capacity on the receiving servers to 
accommodate more virtual machines. In a multi-server 
environment it becomes difficult to assess the possibil- 
ity of, e.g., freeing one server from all its virtual 
machines. Do we have enough combined capacity to 
remove one physical server? How can we find the 
answer to that question? 


We propose a simple notation for the redundancy 
of server capacity called the redundancy level. It is 
“R/S” where R is the number of servers that are in use 
currently and § the number of which servers can be 
shut down and removed. The idea is to calculate the 
available capacity on all servers and the current usage 
of all the virtual machines. From this, the R/S value 
can be derived. It is usually enough to consider only 
static resources as capacity limits with most focus on 
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memory and disk-space if the virtual machines are 
stored locally. 


Memory: 512M Memory: 512M Memory: 612M 
Disk: 30GB Disk: 30GB Disk: 30GB 
VM1 





server3 


Li | 
= 
ts 


Memory per server: 1024 MB 
Disk per server: 100 GB 


Figure 1: An example showing a server redundancy 
level of 3/0 because VM3 larger memory setting. 
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Disk: 25GB 
VMi_ 
Memory: 128M Memory: 128M Memory: 256M 
Disk: 25GB Disk: 25GB Disk: 30GB 
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Memory per server: 1024 MB 

Disk per server: 100 GB 

Figure 2: An example showing a server redundancy 
level of 3/1. With some initial planning, a level of 
3/2 could have been achieved. 


Example 1 


In the following example we have three servers 
and three virtual machines: each of the three servers 
has 1024 MB of memory available for virtual ma- 
chines as well as 100 GB storage space. VMI and 
VM2 each use 512 MB of that memory together with 
30 GB of storage for their harddisks. VM3 uses 612 
MB of memory so that server3 has less free capacity 
than server2 and server]. The available memory (412 
MB) is not enough to accommodate VM1 nor VM2. 
Serverl and server2 have both 512 MB of available 
capacity and cannot accommodate VM3. The result is 
that if server3 should be removed, we will loose VM3. 
The server redundancy level is 3/0 because we have 
three servers and no-one of them can be removed. 


Clearly, if serverl goes down, then server2 
would have enough spare capacity to hold VM1, but 
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the ““S” value in the redundancy metric should be as 
pessimistic as possible and must therefore hold for all 
servers. For only serverl and server2 we have, in fact, 
2/1. 


Example 2 


This example is a little bit more complicated. The 
servers are the same as before, but now we have more 
light-weight virtual machines. We see, that the total 
memory consumption of all four virtual machines is 640 
MB, that is less than the capacity of either one of the 
servers. This means that a single server could accommo- 
date all the virtual machines, which would actually 
result in the redundancy value of 3/2. But unfortunately, 
since VM4 uses 30 GB of disk space, the sum of the 
total disk space consumption would be 105 GB, and that 
is more than a single server can offer. So the reality is 
that the redundancy level only is 3/1. 


This example shows how some initial planning 
could improve the systems overall redundancy level. 
If one server would go down, we would end up with 
two servers with the combined memory capacity of 
2048 MB but only of 640 MB of it actually used. On 
top of that, the redundancy level would then be 2/0, 
meaning two underutilized servers where no-one can 
be spared. 


The redundancy level mirrors the infrastructures 
capability to loose one or more of its servers without 
loss of virtual machines and is therefore valuable 
information to a system administrator. It can be con- 
sidered as a service level policy such as “The infra- 
structure is to keep a 3/\ redundancy level 90% of the 
time.” It can also be useful for capacity planning, cal- 
culating how much more virtual machines of a certain 
type can be deployed before the redundancy level is 
altered. It will also show how much resources a new 
server should have in order to keep a certain redun- 
dancy level for planned additions of more virtual 
machines. The notation should not be confused with a 
division, where 3/1 = 3. It is not meant to be a factor. 
The redundancy level of 4/2 is obviously also not the 
same as 2/1. 


Resource Conflict Matrix 


The previous analysis was from a infrastructure 
point of view. It did not address potential resource 
conflicts between the virtual machines. For this task, 
we propose creating matrixes for every resource type 
and to use graph theory for an indication of the num- 
ber of resource conflicts we have. The analysis is sim- 
plest for the static variables. A resource conflict 
matrix is a square diagonal matrix with all the virtual 
machines. For each virtual machine pair, we put a | in 
their positions if they have a resource conflict for that 
particular resource type, otherwise a 0. 

Let’s consider filesystem placement as a static 


resource (i.e., not current disk usage). Two virtual 
machines have a conflict if their filesystem is placed 
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on the same harddrive. This would in principle imply, 
that two virtual machines placed on the same SAN are 
in conflict even if they run on different servers. These 
conflicts do in other words not imply that the perfor- 
mance is going to be poor. It simply states that there is 
a potential between those two virtual machines that 
they may influence each other. 


The resource conflict matrixes are of size V x V 
where V is the number of virtual machines on the in- 
frastructure. One technique to compress the matrix 
into a single value is to treat the matrix as an adja- 
cency matrix for a bi-directional graph, and to calcu- 
late its connectivity using ~ formula: 


i 
5 Vl) 


The connectivity of a graph is the number of links 
divided by the possible number of links. In this case the 
number of conflicts C divided by the possible maximum 
number of conflicts. Its highest value, 1, would imply 
that there is a conflict between all virtual machines for 
that particular variable. A zero, 0, would mean the 
opposite. For each variable one can therefore easily get 
a value describing its current level of conflicts. 


A simple example of the analysis is given in Fig- 
ure 3. The same case would be for the static CPU 


VMI VM2 VM3 VM4 


VM1 0 l 0 0 
VM2 l 0 0 0 
VM3 0 0 0 0 
VM4 0 0 0 0 


(a) With all servers up. Conflict rate 1/6 = 0.1667. 
VM1 VM2 VM3- VM4 


VM1 0 l 0 0 
VM2 l 0 0 0 
VM3 0 0 0 1 
VM4 0 0 1 0 


(b) With server2 down and VM3 re-provisioned to 
server3. Conflict rate 2/6 = 0.333. 


VMl1 VM2 VM3- VM4 


VM1 0 l l 0 
VM2 l 0 ] 0 
VM3 1 ] 0 0 
VM4 0 0 0 0 


(c) With server2 down and VM3 re-provisioned to 
serverl. Conflict rate 3/6 = 0.5. 


Figure 3: A resource conflict matrix for filesystem 
placement based on Example 2. As long as all 
servers are up, we have a low rate of conflicts. If 
server2 goes down (matrix b and c), the conflicts 
increase depending on how the virtual machines 
are re-provisioned. b) is a better solution than c) 
because it has the lowest conflict rate. 
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resource if two virtual machines on the same single- 
CPU server would be in conflict. Few virtual ma- 
chines, as used in our examples, may produce obvious 
matrixes. On large sites, however, the resulting matrix 
may become too large for manual review. 


For the dynamic variables, such as CPU or IO 
usage, we need to compare the actual behavior of the 
virtual machines. If a time-series profile of each vir- 
tual machine existed, it could be compared to see if 
two virtual machines historically tend to demand the 
shared resources at the same time. If so, they are in a 
conflict. The level of correlation or probability of con- 
flict between the two makes for a more fuzzy descrip- 
tion than | or 0. One can still calculate the average 
probability rate between all the virtual machines, how- 
ever it does not carry the same interpretation as the 
graph connectivity mentioned above, since the result 
would not represent the average number of conflicts 
anymore, but the average conflict probability. Other 
analysis methods could also be applied to the matrix, 
such as Principle Componant Analysis or centrality, 
however the interpretation of these results in this case 
are still being studied. 


The resource conflict matrixes are a resource 
centric description of the infrastructure’s state. Con- 
nectivity or average conflict levels are ways to com- 
press information into more usable formats. Some 
level of intuition and knowledge is still needed to 
interpret its results. It can highlight uneven distribu- 
tion of resources where there should be none. A 
matrix is beneficial for a graphical representation and 
can contain more information than only an average. 


We also need a virtual machine-centric perspec- 
tive of the infrastructure in case we investigate a par- 
ticular virtual machine or want to make the best provi- 
sioning decision for a new one. 


Location Conflict Table 


The previous section showed how we can get an 
overall view of the resource conflicts and that re-pro- 
visioning of virtual machines influences the result 
based on what server it is placed on. The example was 
only for a single resource. How could we get an 
impression of the conflicts concerning all resources in 
order to make the best decision? Our solution to this 
problem is to list the available servers based on the 
resulting conflicts for all resources if that would be its 
location. Let us consider the case above where server2 
goes down and VM3 needs to be moved. The resulting 
location conflict table is shown in Table 3. 


Location conflict table for VM3 


Filesystem 
Location Placement CPU Total 
server] 2 2 4 
server3 l l 2 


Table 3: An example location conflict table showing 
server3 with less conflicts than server1. 
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For servers with equal performance, one should 
choose the one with the least resource conflicts. In 
many cases, some conflicts are unavoidable. Some 
conflicts may then be given more importance, such as 
those based on time-series profiles. 


The information in this table will potentially 
change for all virtual machines every time a virtual 
machine is re-provisioned. This means that if server2 
now contained two virtual machines, we would have 
to make a decision about the first and then the second 
as if the first has already moved. 


Implementation Into MLN 


The methods mentioned above are general ways 
to analyze the state of the infrastructure. The benefit 
with quantifiable metrics is that their calculation can 
be automated. The service redundancy level, resource 
conflict matrixes and location conflict table are imple- 
mented into MLN for static variables with preliminary 
support for dynamic variables also. MLN has enough 
knowledge about virtual machine location and re- 
sources to determine conflicts automatically. 


The benefit of this information is currently stud- 
ied in a master thesis with regard to automated re-pro- 
visioning and analysis [13]. MLN is only capable of 
providing current state values so it cannot compute 
“what if”’-results for planning yet. Work is under way 
to include analysis features where the user could pro- 
pose changes to the provisioning or other virtual 
machine metrics and see the result before they are 
committed. 


Discussion 


The server redundancy level only assumes that 
we will re-provision virtual machines from a given 
server onto the others. It does not assume that we 
could re-arrange all the virtual machines optimally on 
all server for the highest utilization. We have identi- 
fied some additional challenges in that case which are 
still under consideration. The most important point is 
what strategy to use for the re-provisioning process. 
We have identified a few alternatives: 

¢ Least Migrations This would reduce the risk 
of virtual machines dying in the migration 
process. 

¢ Least Memory Copied Migration time de- 
pends on the main memory of the VM to be 
copied in the background. Less memory to 
copy would result in faster migration. 

¢ Minimize resource conflicts Never-mind the 
number of migrations, just reduce the number 
of conflicts to a minimum. 

° Most Important Last Some VMs are more 
important than others. Avoid touching the most 
important. 


Which strategy is best depends on the context. In 
an automated re-provisioning scenario, the user should be 
able to express to the system which strategy is preferred. 
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A resource conflict is not equal performance 
degradation, but it may be a good indicator of where 
to look for answers if the virtual machine performs 
poorly. The resource conflict matrixes can be used to 
look up the current conflicts of a given virtual 
machine at any time. We consider this helpful support 
information for system administrators. 


Some of the resources and conflicts are diffuse. 
Consider five single-CPU virtual machines on a four-CPU 
server. Who is in conflict with who? The strictest interpre- 
tation is that all five are in conflict with each other. 


The connectivity or average conflict rate may not 
carry much information at the first calculation. The 
optimal or lowest possible value is entirely context 
dependent. It is therefore not suitable to be used as a 
comparison between two different infrastructures. How- 
ever it is valuable as a measurement at re-provisioning. 
It is a way to observe how an addition of a new or exist- 
ing virtual machine influences the rest. 


Conclusions and Future Work 


Virtual machines can easily be deployed using 
MLN and have acceptable performance levels through 
the Xen virtual machine framework. The ideal sce- 
nario would be one virtual machine on each server, but 
that is not always possible. Finding the best placement 
of the virtual machines in order to avoid resource con- 
flicts is an open challenge. 


Three methods were proposed for infrastructure 
analysis based on our own experience. Each method is 
technology independent and can be applied by hand 
for fair-sized infrastructures. They are also imple- 
mented in MLN in order to cope with larger scenarios 
where manual calculation is impractical. 


We recognize that these methods have issues of 
their own but we see this as steps towards a better 
understanding of how to manage virtual machines suc- 
cessfully. This research field is only just emerging as 
more start to use virtual machines on a large scale and 
encounter the same kind of problems. 


Work on MLN will continue in this direction and 
may function as a way to let the industry test our ideas 
and provide us with valuable feedback. Users should 
also be able to define other resources which are 
included into MLN’s existing analysis. This may be 
easy to accommodate for static variables, however the 
dynamic ones would also imply that a monitoring 
framework of that variable should exist. 


Related Work 


In recent years, many researchers have promoted 
the use of virtualization and virtual machine technolo- 
gies in scientific computing and HPC environments. 
Such arguments appeared initially in the literature of 
grid computing. For example, Figueiredo and cowork- 
ers [6] proposed using virtual machine technology in 
combination with management middleware to simplify 
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the task of seamlessly providing distributed computing 
resources to grid end users. They noted that using vir- 
tualization provided several advantages over grids 
constructed via real systems. Virtualization also offers 
many advantages to high performance computing; for 
a succinct summary, see Mergen, et al. [5]. 


Several groups have performed performance anal- 
yses of scientific problems in virtual environments. 
Figueiredo and coworkers [6] compared the perfor- 
mance of their virtualization-based approach with that 
of the regular operating system by running a few of 
the benchmarks from the SPEC suite directly on a 
Linux system and in a virtual machine running the 
same operating system under VMWare. Their results 
indicated that only that minimal overhead was associ- 
ated with employing virtualization, in the range of 
about 2-4%, They also considered the time required 
for virtual machine setup, either via the full startup 
process (i.e., VM reboot) or by restoring a saved VM. 
Startup times generally ranged from about 30 seconds 
to 1 minute, depending of the specifics of the startup 
method and VM disk file storage location. 


Youseff and coworkers [10] compared the per- 
formance of Red Hat Enterprise Linux running di- 
rectly on the hardware as well as a guest operating 
system under the Xen environment via a series of 
standard benchmark applications. Their tests included 
three categories of calculation: micro-benchmarks each 
focused on a single system resource (CPU, network 
communication, memory and disk I/O), a series of 
matrix-based computations designed to test parallel 
program execution efficiency, and a single scientific 
simulation taken from the HPC Challenge benchmark 
suite (the MIT GCM exp2 which models a planetary 
ocean circulation process). In all cases, including the 
latter, these researchers found no statistically sig- 
nificant performance degradation from employing vir- 
tualization. 


Bjerke [14] implemented Xen virtualization for 
HPC on Itanium systems. He presents some per- 
formance data for typical software building processes, 
again finding little difference between the native sys- 
tem and Xen virtual machine instances. 


Finally, Vogels has explored virtualization for 
HPC in the Windows environment/.NET framework 
[15]. He ran benchmarks from the SciMark suite, 
focusing on comparing different virtual environments. 
In general, however, his results indicate that vir- 
tualization is a viable alternative for Java-based high 
performance computing in this environment as well. 


Previous work on checkpointing has focused on 
general solutions for applications on UNIX systems. 
For example, Plank and coworker [7] created the 
libckpt library with the goal of rollback recovery for 
an executing program on UNIX systems. The library 
worked by periodically saving the application’s cur- 
rent state to a disk file. In the event of a failure (.e., 
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program crash), the program could be restarted from 
the most recent save point (checkpoint). The facility 
was capable of operating in a fully automated mode, 
without requiring any program modifications, for ex- 
isting applications; programmers could also add check- 
pointing-related directives to the code if the source 
code was available. The former is most directly com- 
parable to virtualization. The researchers tested their 
approach using several computationally-intensive bench- 
marks. For these applications, the overhead associated 
with using libckpt in fully automated mode was con- 
siderable, ranging from about 5% to about 15%, with 
the larger values associated with the larger and most 
realistic benchmarks. 


Wang and coworkers [8] performed similar work 
at about the same time. Their libckp library for UNIX 
systems similarly periodically saved the execution 
state from user applications in a transparent manner. 
Their work also included a generalization feature 
which allowed a checkpoint to be reused with different 
input/data. 


VM management has also received substantial 
research attention. The closest work to MLN is prob- 
ably the In-VIGO system of Adabala and workers 
[16]. It is an example of using virtualization for grid 
computing, specifically by constructing virtual grids 
on top of real systems via a software middleware 
layer. In this way, In-VIGO provides grid computing 
environment in which the actual physical systems and 
resources are transparent to grid users. It is designed 
to both simplify using the grid computing resources 
for end users as well as to simplify the grid man- 
agement tasks for system administrators. 


However, unlike MLN, it is a quite complex sys- 
tem (albeit a powerful one) necessitating a significant 
learning curve. In addition, many of its features are 
simply not needed for production scientific comput- 
ing. The follow-on work, the VMPlants facility [17], 
provides a facility that is similar to MLN. It provides 
virtual machine creation, shutdown and cloning. Its 
operation is client-driven, in keeping with the goals of 
In- VIGO of simplifying the end user experience, con- 
trast to MLN, which can employ either a push or pull 
approach. 


Finally, Clark and coworkers [9] have studied 
migrating running virtual machines across physical 
hosts without the need for even temporary suspension. 
They found that the challenges of migrating live mem- 
ory to be challenging, especially for applications with 
intensive memory write rates. They offer a solution for 
VM live migration within a group of discrete systems 
or a cluster with high performance network intercon- 
nects and network-based disk storage. 


Computation Details 


All Gaussian 03 [11] jobs were limited to 256 
MB of memory and used the 32-bit version of the 


102 21st Large Installation System Administration Conference (LISA ’07) 


Begnum, Disney, Frisch, & Mevag 


program. The VMs used were allocated 512 MB of 
memory. The key components of the computer config- 
urations were as follows: 1.8 GHz Xeon with 1 GB 
main memory (master node for parallel jobs); 2.8 GHz 
Xeon with 2 GB main memory; 2 GHz AMD-64 with 
1 GB main memory. The network interconnect was 
100BaseT. Parallel jobs made use of the Linda parallel 
execution environment [12], as required by the appli- 
cation. Job details: (1) HF/3-21G Opt Freq on Buck- 
minsterfullerene, 540 basis functions; (2) B3LYP/ 
3-21G Force SCF=NoVarAcc on Valinomycin, 882 
basis functions; (3) MP2/6-311+G(2d,2p) Opt Freq on 
Malonaldehyde, 171 basis functions. All jobs were run 
on otherwise idle systems. 


The following table gives the raw results for 
these jobs. The column headings have the following 
meanings: RHEL=job run on system running standard 
kernel; Xen=job run on system running Xen-enabled 
kernel, but not in a VM; VM=job run in a Xen VM 
(RHEL guest OS). The single processor jobs were all 
run on the slowest node, which also served as the mas- 
ter node for the parallel jobs. 


Elapsed Time (seconds) 


Job Single Processor Parallel Execution 
RHEL Xen VM Xen*3 VM*3 

l 346 359 = 348 169 168 

2 727 435 8 8=6 33 268 265 

3 420 433 453 366 338 

Table 4: Raw results as elapsed time for each test sce- 
nario. 
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ABSTRACT 


OS Circular is a framework for Internet Disk Image Distribution of software for virtual ma- 
chines, those which offer a “virtualized” common PC environment on any PC. OS images are ob- 
tained via the stackable virtual disk ““Trusted HTTP-FUSE CLOOP”’. 


The system is designed to utilize Mirror servers and Proxies for highly-scalable worldwide 
deployment. OS Circular easily and efficiently handles both partial and periodic OS updates, in- 
cluding a rollback facility to ease experimentation with new OS images that might not be ready for 
production. This paper describes the design of OS Circular and the techniques to reduce the net- 


work traffic for quick downloading and booting. 


Introduction 


We have developed “OS Circular” [1, 2], a tech- 
nique for booting any OS on an anonymous PC over 
the internet. It enables selection of OS images and can 
facilitate a reference installation. This means that an 
OS image can be “installed” to ease testing and feasi- 
bility testing of new functionality before it is installed 
on the local hard drive for production use. The previ- 
ous (reference) version of an OS image stores old ap- 
plication software and enables opening of files in any 
previously supported format. In a world of frequent 
security updates, this environment dramatically en- 
hances the ease of testing OS service packs. 


Virtual machines host a common PC environment 
on a standard PC. Software to host virtual machines is 
easy to obtain and use, with open source packages QE- 
MU, KQEMU, KVM, and Xen in addition to free ver- 
sions of VMware and Virtual Box. Recent performance 
of virtual machines has quite low overhead and enables 
use of guest OSes without high resource utilization. 
Furthermore, newer x86 architecture CPUs have a vir- 
tualization extension (Intel’s VT or AMD’s SVM) that 
promotes even better virtual machine software (includ- 
ing a mode for trapping sensitive instructions). Both 
KVM and Xen-HVM use this virtualization extension 
and offer full virtualization. 


One current challenge for virtual machines is the 
scheme used to share virtual disks on the Internet [3, 
4]. These virtual disks represent both OSes and appli- 
cation software and should be updated periodically 
(e.g., for security). The previous disk image should be 
archived efficiently and might be used for replaying 
the old OS image when an update is unsatisfactory. 
New master disk images could be shared with a poten- 
tially massive number of users, and efficiently doing 
that is one of the primary goals of OS Circular. 


OS Circular is designed to utilize existing infra- 
structures in order to enable global deployment and 


minimize maintenance cost. As a client-centric sys- 
tem, OS Circular reduces server load by limiting serv- 
er functionality to file distribution. OS Circular uses 
HTTP for file distribution because of the ease of ac- 
cessing Web hosting services (including exploitation 
of mirror servers and cache proxies). 


The client PC checks data for validity during 
downloads (which are provided by Trusted HTTP- 
FUSE CLOOP, a stackable virtual disk). Trusted 
HTTP-FUSE CLOOP includes support for update of 
virtual disks. OS Circular uses full virtualization (to 
avoid exploits due to the kernel of the guest OS being 
insecure). It also includes an automatic security update 
service. 


This paper details OS Circular’s design and im- 
plementation paradigm. We mention related work, 
then describe the virtual machine as an abstraction 
layer. The next section describes the requirement of 
virtual disks. Subsequently, we discuss details of the 
Trusted HTTP-FUSE CLOOP and the current imple- 
mentation and performance of OS Circular. Finally, 
we discuss future plans and conclude. 


Related Work 


Virtual disks are a popular means for distribution of 
ready-to-use OS images for virtual machines. The OS 
Zoo project [5] distributes many virtual disk files for 
QEMU (an open source machine emulator and virtualiz- 
er). This eases experimentation with various operating 
systems since installation is relatively simple. However, 
virtual disks often must be treated in some ways as a sin- 
gle — potentially extremely large — file. Downloading 
hundreds of megabytes can take quite a while. Further- 
more, even the update of a single bit requires the (time- 
consuming) reconstruction of the entire virtual disk, a re- 
al detriment when an urgent security update is required. 


FLOZ (Free Live OS Zoo) [6] is an interesting 
derivative work of OS Zoo project. It enables booting 
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of many OSes via a Web browser. FLOZ runs a QE- 
MU virtual machine on the server and transfers the vi- 
sual console of the QEMU to a Web browser on some 
client. While very innovative for OS testing, perfor- 
mance and scalability are limited due to its server-cen- 
tric design which also suffers greatly as Internet net- 
work latency (for file access) increases. Furthermore, 
OSes hosted on FLOZ do not allow Internet connec- 
tions due to server network resource and security con- 
cers. 


“Collective” [3] and “‘Ventana”’ [4] also propose 
client-centric systems. They run virtual machines on a 
client and download “‘diffs’”’ of updated OS images. 

© Collective uses VMware’s COW (Copy On 
Write) feature for partial update. This is a disk 
block level mapping and thus handles any 
filesystem format. Updated data is saved to a 
file, making it easy to deal with. Unfortunately, 
it is difficult to map many COW images. The 
developers established the company called Mo- 
kaS5 to offer LivePC [7], an improved version of 
Collective. 
Ventana is a virtualization-aware filesystem with 
versioning, access control, and disconnected op- 
eration. It is a management system for virtual 
disks and offers a customized view of a disk im- 
age for each user. Unfortunately, it is based on 
NFS and it is not designed for a massive number 
of anonymous users on the Internet. 


Virtual Machines 


Full virtualization offers a sort of abstraction lay- 
er for OSes that provides a common virtualized PC en- 
vironment on an anonymous PC. We can install a 
guest OS with its normal installer, update the OS with 
its usual package management software, and migrate it 
to other PCs via virtual disks. Various virtualizers do 
this in different ways: VMware is used on SoulPad 
[8], VAT (Virtual Appliance Transceiver) of Collective 
[3], and Internet Suspend/Resume [9]. 


Device Model 


It is critical that full virtualization offer the same 
device model that a guest OS expects. When this vir- 
tualization is present, the guest OS need only prepare 
drivers for abstracted devices — not for any specific 
model or type of disk. 


QEMU-DM (Device Model) is becoming popu- 
lar in the open source community. It assumes RealTek 
RTL8029 for NIC, Cirrus Logic GD5446 for Video 
Card and a few others. Both Xen and KVM also sup- 
port QEMU-DM and guest OS only have to support 
them. 


The differences among IA32 architectures cause 
no problems for the emulators because recent OSes of- 
fer common i386 packages for IA32 which run on 
Pentium Pro or later I[A32 CPUs. QEMU, KQEMU 
and KVM offer the Pentium II architecture for guest 
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OSes but Xen-HVM offers the same architecture as the 
real CPU upon which it is running. These differences 
are mitigated by the “universal” packages of OSes. 


VM Loader 


OS Circular requires both a virtual machine and 
a stackable virtual disk. For convenience, we have de- 
veloped and offer a 1CD Linux installation which in- 
cludes both of these. 


This host OS supports the drivers for real de- 
vices. SoulPad and VAT of Collective use KNOPPIX 
as the host OS, because KNOPPIX has an ‘‘AutoCon- 
fig’’ function that automatically detects available de- 
vices at boot time and loads the appropriate drivers. 
The combination of AutoConfig and full virtualization 
of the host OS acts as a virtual machine loader. 


We offer VMKNOPPIX, a collection of virtual 
machine software on KNOPPIX, as a VM Loader. It 
includes QEMU, KQEMU, KVM, and Xen-HVM, 
each of which offers full virtualization. Xen-HVM and 
KVM require a CPU with the virtualization extension 
(Intel VT or AMD-SVM) but QEMU and KQEMU 
run on any IA32 architecture. KNOPPIX acts as the 
host OS and has drivers for almost any PC. 


Requirements for Virtual Disks 


Virtual disks should offer features such as ver- 
sioning, globalization, and security; see [4] for the 
main requirements. These requirements motivate the 
design for Trusted HTTP-FUSE LOOP. 


Versioning 


Versioning of virtual disks is very desirable: 
® non-persistent versioning is used for “undo” of 
operations 
e while persistent versioning is used for “roll- 
back”’ of OS image 
Versioning is also important for sharing and cus- 
tomization of virtual disks. 


Some virtual machine software supports the “‘un- 
do” function, including the “‘non-persistent” mode of 
VMware and “CopyOnWrite” of QEMU and User- 
ModeLinux. Xen uses DeviceMapper of Linux for 
non-persistent versioning. Virtual Disks should offer 
persistent versioning. 


Trusted HTTP-FUSE CLOOP has the same ver- 
sioning scheme as Venti [10], Plan9’s archival storage 
system that permanently stores data blocks (blocks 
which comprise the data of a filesystem — the filesys- 
tem’s structure is independent of the block storage). 
Each block is stored in a file whose name is its SHA1 
hash. The system enforces a write-once policy since 
no other data block should ever have the same hash. 
Duplicate data is easily identified (since it has the 
same hash) and a given data block is stored only once. 
Data blocks cannot be removed, making the system 
ideal for permanent or backup storage. Unfortunately, 
Venti requires a special protocol for accessing the 
filesystem, and this limits its scalability. 
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Trusted HTTP-FUSE CLOOP saves data blocks 
in much the same way as Plan9 and utilizes HTTP to 
distribute them. 


Globalization 


One goal of OS Circular is the provision of virtu- 
al disks that can be shared via Internet. This does not 
connote the complete downloading of a virtual disk 
file but rather access to any part of a disk as requested. 


The “‘Remote Block Device” scheme meets this 
requirement and is found in many implementations, 
including iSCSI, AOE (ATA over Ethernet), iFCP, 
and more. Unfortunately, all of these use special pro- 
tocols and require special daemons on the server. 
Most of them are designed for use on a high speed 
LAN and aren’t as useful on the slower Internet. 
Trusted HTTP-FUSE CLOOP re-constructs a virtual 
disk with the partial block files which are download- 
able via HTTP. 


Virtual disks also require a “disconnect opera- 
tion” to achieve Mobile Computing. The “AFS” and 
“Coda” filesystems deal with the disconnect opera- 
tion but also require special protocol and daemons for 
servers. Stateless Linux [11] offers a disconnect opera- 
tion for a Thin Client PC. Stateless Linux runs with 
network storage or snapshot image saved local stor- 
age. Block files of Trusted HTTP-FUSE CLOOP are 
also saved to a local storage and re-usable (when 
whole block files are saved in a local storage, a net- 
work connection is not required). 

Security 

Basically, security management is independent of 

the virtual disk implementation. Security of the kernel 


and applications should be managed by security soft- 
ware or package manager. A virtual disk makes only 
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the commitment to keep the integrity of its contents. 
Probably the biggest part of the security for a network 
virtual disk is the prevention of intrusion. One way to 
prevent such an intrusion is the use of secure commu- 
nication, but this requires a fixed server which limits 
scalability. Trusted HTTP-FUSE CLOOP adopts a 
client-centric contents validation mechanism with a 
driver that checks the validity of block contents when 
mapped to the virtual disk. It allows distributing block 
data by anonymous servers. 


Trusted HTTP-FUSE CLOOP 


KNOPPIX’s CLOOP (Compressed Loop back 
device) spawned the big picture idea for HTTP- 
FUSE CLOOP’s filesysem. KNOPPIX saves block 
device data to a file in order to reduce disk consump- 
tion (though a CLOOP file is still big). Traditional 
CD-KNOPPIX requires about 700 MB and must 
treated as one file, slowing downloads. When even 
just a single bit is updated, a big CLOOP file must 
be rebuilt. Furthermore, CLOOP has no security pro- 
tection. 


To mitigate these problems, we adopted the 
block management style of Venti [10]. Data on a block 
device is partitioned into data blocks of a fixed size, 
compressed, and saved. Block files are treated as net- 
work transparent between local and remote machines, 
with local storage acting as a cache. The downloaded 
block files are validated by their SHA1 hash value. 
This yields these features: 

¢ A block file is made of split, compressed block 
device blocks (default size is 256 KB) blocks 

(The original CLOOP’s split block size was 64 

KB which was too small and created too many 

files). 










Address File Name 
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The block files are re- 
constructed as a virtual disk 
with HTTP-FUSE CLOOP 


Block file is named by 
SHA1 digest of its contents 


Figure 1: Creation of block files from OS image. 
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° Block files are mapped to a loopback device 
with a mapping table file. 

¢ This mapping for block files is performed when 
a relevant read request is issued. After map- 
ping, the block file is erasable from local 
(cache) storage, since it can be downloaded 
anew any time it is needed. 

e The name of a given block file is the value of 
its SHA1 hash with all the good properties list- 
ed above. 

e Block files are downloaded from the HTTP 
server because HTTP is expected to be strong 
file delivery infrastructure as demonstrated by 
mirror servers and proxy servers. 
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e A proxy server has a size limitation on cached 
file sizes so block files should be smaller than 
this size. 

When mapping a block file to the loopback de- 
vice, the block’s contents are hashed into a 
SHA1 file name which is listed in the mapping 
table file. 

The block device is is Partially Updatable: 
When an application is updated on the original 
block device, relevant block files and the map- 
ping file are renewed; cache block files related 
to the non-updated block are reusable. 

“FUSE” (File system in USEr-space) [12] is 
used to implement the virtual loopback device. 
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Figure 3: Block mapping of Trusted HTTP-FUSE CLOOP. 
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Figure | depicts the creation of block files and 
the mapping table file “map01l.idx”” which is also 
made from a block device which includes root filesys- 
tem. Each block file is named by its SHA1 hash. 


Figure 2 shows the block diagram of Trusted 
HTTP-FUSE CLOOP. The loopback device is mapped 
as a normal block device on a virtual machine. The 
main program is implemented as a part of FUSE wrap- 
per program. A maintains the validity of the virtual 
block and must be distributed in secure way. The map- 
ping table file is used to set up Trusted HTTP-FUSE 
CLOOP. When a read request is issued, Trusted 
HTTP-FUSE CLOOP driver searches the relevant 
block file using the mapping table. When the relevant 
file exists on a local storage, that file is used; other- 
wise, the file is downloaded from Internet. 


Block file elements are downloaded from an 
HTTP server with “libcurl’’, the Client for URLs li- 
brary. Each downloaded file is decompressed by “libz”’, 
hashed by “libcrypto”, and logged into /var/log/fs_ 
wrapper_PID.log. Invalid file transfers (due to network 
errors or security breaches) are detected using the SHA1 
hash. Figure 4 shows an example that detects a defective 
block file. The downloaded block file is first stored at 
local storage. If the storage space is insufficient (more 
than 80% used), previously downloaded files are re- 
moved by LIFO (a water mark algorithm). Figure 3 
shows Trusted HTTP-FUSE CLOOP from the view- 
point of block mapping. 

Partial Update By Adding Block Files 
The mapping table handles block addressing. Any 


update of HTTP-FUSE CLOOP is performed by adding 
updates both to the block files and to the mapping table. 
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Unchanged (local) block files become reusable for 
caching. To achieve this function, the HTTP-FUSE 
CLOOP filesystem treats block-unit update as an EXT2 
filesystem. The “iso9660” filesystem turns to be unsuit- 
able for HTTP-FUSE because partial update of an 
iso9660 file changes the location of subsequent blocks. 
As usual, updated blocks are saved to a file named by 
the SHA1 hash of the block. Collision of file name is 
extremely rare (this being the goal of the SHA1 hash). 
In the unusual event of a collision, we can check and re- 
pair the problem before uploading the block files. 


Figure 5 shows an example of an HTTP-FUSE 
CLOOP update which creates a new mapping table 
file ““map02.idx”’ and associated block files. This is 
particularly useful when updating KNOPPIX applica- 
tions, especially in the case of security updates. Fur- 
thermore, we can rollback to an old filesystem by rein- 
stating the previous mapping table ‘“‘map01.idx”’ and 
block files. 


Optimization for Download 


Trusted HTTP-FUSE CLOOP is sensitive to net- 
work latency because small block files are download- 
ed in the order the blocks are requested. 


Our original implementation suffered performance 
problems at boot time until we implemented two new 
functions: “netselect” and “DLAHEAD” (download 
ahead). 


The “‘netselect” functionality searches for the 
lowest latency download site among candidates (through 
judicious use of “ping”). We arranged the HTTP sites to 
be dispersed across across the global Internet, a spread 
that also foster load-balancing of HTTP services. 


1150452051.109: #00000000(845b31ded38el5clfa8febf97fe0781f23af98c3) :missed. 
1150452051.112: #00000000(845b31ded38el5clfa8febf97 fe0781f23af98C3) shits. 

1150452051.112: #00000001 (166cbaedbb1cec836e7c95d7d9943efde5a53829e) :missed. 
1150452051.113: #00000002 (29c4e363dbad648072751calf856e5780dd2981d) :missed. 
1150452051.114: 400000003 (fa8ad05b713a9cf£8a701636ca6c353dc58fdb6bFd) :missed. 
1150452051.114: 00000004 (1£82a543£a9310c44eff6a13618beca3cacffcl2) :missed. 
1150452051.128: #00000004(1£82a543fa9310c44eff6a1l3618beca3cacffcl2) :hits. 

1150452051.128: 400000005 (916£62a6e2caedc1279a0a74975a406ddb60ec25) :missed. 
1150452051.129: #00000006(19111df£c877a4fe241e125d10176d85a99b4bb86) :missed. 
1150452051.130: #00000007 (950c1d7623b374f8e03309a93041f5adfa3af80f) :missed. 
1150452051.130: #00000008 (48647 2b0ee27157d755bd59d623179cfc0034747) :missed. 

Figure 4a: Correct downloading of block files. 

1150452375.989: {00000000 (845b31ded38el15clfa8febf97 fe0781f23af98C3) :missed. 
1150452375.993: 400000000 (845b31lded38el5clfa8febf97 fe0781f23af98C3) shits. 

1150452375.993: 00000001 (166cbaedbb1cc836e7c95d7d9943efde5a53829e) :missed. 
1150452375.994: 00000002 (29c4e363dbad648072751calf856e5780dd2981d) :missed. 
1150452375.995: #00000003 (fa8ad05b713a9cf8a701636cabc353dc58fd6bfd) :missed. 
1150452375.996: 400000004 (1£82a543fa9310c44eff6a13618beca3cacffcl2) :missed. 
1150452375.997: #00000004(1f82a543fa9310c44eff6a13618beca3cacffel2) shits. 

1150452375.997: 400000005 (916£62a6e2caedc1279a0a74975a406ddb60ec25) :missed. 
1150452375.998: 00000006 (19111df£c877a4fe241e125d10176d85a99b4bb86) :missed. 


E: can’t validate block. 


Figure 4b: Invalid block file is detected. 





Figure 4: Log of Trusted HTTP-FUSE CLOOP (/var/log/fs_wrapper_PID.log). The “missed” tag indicates a block 
requires downloading; “hits” indicates the block file is in local storage. 
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Figure 5: Update of HTTP-FUSE CLOOP. 
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Figure 6: Ext2 filesystem is translated to block files of HTTP-FUSE CLOOP and the occupancy of accessed 4 KB 
data blocks at boot time. 
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Figure 7: Ext2optimizer repacks the data backs to be formed in line. 
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DLAHEAD downloads the block files necessary 
for boot, potentially before they are requested, from a 
list of required block files contained in the boot pro- 
file. While this profile depends on the particular client 
PC, the difference among varying boot sequences is 
small with the greatest variation caused by the device 
drivers themselves. DLAHEAD establishes multiple 
(default: four) connections in order to download block 
files in parallel and thus reduces apparent network la- 
tency. Downloaded block files are saved to a local 
storage and work as a cache. 
File System Optimization 

Any block size mismatch between filesystem and 
virtual block device causes fragmentation. For example, 
EXT2 filesystems generally assume a 4 KB blocksize 
while HTTP-FUSE CLOOP assumes blocks of size 
256KB. Even if a single block (4 KB) of a local file is 
requested, HTTP-FUSE CLOOP downloads the rele- 
vant (compressed) 256KB block file and decompresses 
it. If the next read request isn’t included in some al- 
ready downloaded block files; HTTP-FUSE CLOOP 
must download and decompress the another block file. 


A typical boot sequence and subsequent system op- 
eration requires much smaller size blocks than the down- 
loaded block files. This level of “occupancy” depends 
on the filesystem type, the size of the entire filesystem, 
the block size, and the stored data. Reference [13] re- 
ports the occupancy of block files was 30% for running 
KNOPPIX 3.8.2 (when the filesystem was EXT2, the 
volume was 2 GB, and the block size was 256KB). 


We applied “ext2optimizer”’ [13] to the filesys- 
tem for HTTP-FUSE CLOOP to create a profile of the 
actually accessed physical blocks at boot time. We 
then repacked those blocks into physical blocks locat- 
ed at the “front” of the file store. The ext2optimizer 
keeps the structure of ext2 filesystem even when the 
physical blocks rearranged, so the filesystem works 
the same as before. 


Figures 6 and 7 show the effect of ext2optimizer. 
Figure 6 is the image in which the normal ext2filesys- 
tem is translated to block files of HTTP-FUSE CLOOP 
with accessed data blocks scattered throughout various 
block files. This requires download of extra data (most 
4K blocks are not used in the boot process) and thus in- 
creases boot time. Figure 7 shows the effect of ext2opti- 
mizer. The accessed data blocks are repacked to increase 
the efficiency of downloads and reduce boot time. 


From the filesystem point of view, ext2optimizer 
causes fragmentation but reduces download of block 
files and makes for a quick boot. 


Implementation of OS Circular 


The guest OS images are distributed by Trusted 
HTTP-FUSE CLOOP. VMKNOPPIX boots the host OS 
on a client PC and runs a virtual machine. In the current 
implementation, Debian GNU/Linux and FreeBSD are 
bootable on QEMU, KQEMU, KVM and Xen-HVM. 
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Security Updates 


Debian GNU/Linux is supported by a strong 
community and offers a network update scheme [14] 
called the APT (Advanced Package Tool) which uses a 
list of repositories and available packages. If a newer 
package appears in the repository’s list, APT down- 
loads the package and hands the process over to “‘dp- 
kg” (a medium-level package manager for Debian). 
The package manager checks the integrity of the pack- 
age, updates the software, and manages security of the 
total contents. 


Figure 8 details the image of an update on OS 
circular. When the master OS image on a virtual ma- 
chine is updated by the “apt-get”? command, both a 
new mapping table file and block files are created. 
These files are uploaded onto HTTP Servers. The 
Client PC subsequently sees a list of mapping table 
files and selects one. Older mapping table files repre- 
sent less updated OS images. When we want to use an 
older OS image, the old mapping table file is selected. 


We monitored upgrades of Debian packages 
from 07/Dec/2006 to 11/Dec/2006. A total of 854 
packages and a 4 GB virtual disk were translated to 
14523 block files (1.9 GB) on 07/Dec/2006. The 104 
packages were updated with 3420 block files (335 
MB) created on 11/Dec/2006. The new block files and 
mapping table file were added to HTTP servers and 
the clients used new OS image by rebooting with the 
new mapping table file. 


Download Sites for OS Circular 


Figure 9 shows the global location of OS Circu- 
lar download sites. We dispersed these sites across the 
globe to prevent intercontinental connections and pro- 
vide reasonable latency for downloading block files in 
US, Europe, and North Asia. Most of these sites are 
commercial Web hosting services with reasonable 
prices and a high level of maintenance. 


OS Circular has a name resolver which tries to 
supply close mirror servers. The mirror servers are re- 
named to the unique sub-domain name which is regis- 
tered on our DNS, so each client uses DNS lookup at 
our central site. This DNS is operated by DNS-Bal- 
ance [15] which resolves the sub-domain name as the 
nearest mirror server (by using routing information of- 
fered by RADB.net, the Routing ASSET Database). It 
is a server-side solution instead of “‘netselect”’ on the 
client. Figure 10 shows an example of DNS-Balance. 
The degree of accuracy depends on RADB.net, but in- 
tercontinental download is almost always avoided. 

DNS-Balance is also used for load balancing and 
fault-tolerance because it can replace a server at the 
timeout of keep alive of HTTP1.1. 


Performance 


We measured the boot time of OS Circular on an 
IBM/Lenovo Think PAD T60 with a 2 GHz Core2 
Duo T7200 and 2 GB Memory. The network latency 
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was synthesized with a ““dummynet”’ to monitor its ef- 
fect. The synthesized network delay was tested at both 
0 and 30 msec to emulate LAN and ISP environments. 
We also checked the effect of ext2optimizer and DLA- 
HEAD, both of which aimed to reduce total traffic and 
reduce the effect of network latency. 


[ Mapping Table ) New mapping Cache files at 
Oe table file local storage 


Oj —Se6o0 


ae 


Security 
pdate 






Full virtualized VM 


Figure 8: Security update of Trusted HTTP-FUSE 
CLOOP on OS Circular. 















Release Pkgs_ Block Files Size 
Orig. 07Dec2006 884 14,523 1.9 GB 
Update 11Dec2006 104 3,430 335 MB 


Table 1: The difference of updated blocks. 





Figure 9: Download sites for OS circular. 
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Figure 10: Searching the nearest download site by 
DNS-Balance. 
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We compared the boot time on Xen-HVM and 
KQEMU. Xen-HVM requires the CPU virtualization 
extension but KQEMU runs on any normal IA32 
CPU. The Debian Etch on OS Circular was measured 
until the gdm (GNOME Display Manager) appeared. 


Figures 11 and 12 show the results on Xen-HVM 
and KQEMU. The three lines show the case of no op- 
timization, ext2optimizer, and ext2optimizer coupled 
with DLAHEAD. Each line’s statistics commenced 
when GRUB read the kernel and initrd. “Boot” ends 
when the gdm appears. 


Xen-HVM vs. KQEMU 


The differences in boot times were not so big on 
Xen-HVM and KQEMU. In general, Xen-HVM is 
faster than KQEMU, perhaps twice as fast when the 
latency is 0 (with no optimization). This advantage 
was reduced, though, when both ext2optimiser and 
DLAHEAD were applied. For 30 msec latency, the 
difference was small and the boot time mostly depend- 
ed on these optimizations. The results showed the im- 
portance of the access speed of virtual disk. 


Magnitude of Traffic 


The original block device was a 3 GB ext3 for- 
matted disk and included 2 GB of Debian packages. 
The results show that the original filesystem contains 
blocks scattered through the filesystem. The ext2opti- 
mizer solved this problem. 


Without optimization, downloaded disk (network) 
traffic measured 68 MB on Xen-HVM and 58 MB on 
KQEMU (the difference was caused the differing CPU 
architectures, Core2 Duo of Xen-HVM and Pentium-II 
of KQEMU) and the BIOS. The BIOS of Xen-HVM is 
not a fully supported function of QEMU. 


We created a disk access profile for Xen-HVM, 
applied ext2optimizer, and recreated the block files. 
This reduced the downloaded block files for Xen- 
HVM to 40% of its original (from 68MB to 27MB) 
HVM and for KQEMU to 47% of its original (from 
S58MB to 27MB). 


DLAHEAD downloads soon-to-be-accessed block 
files with four parallel downloading connections. DLA- 
HEAD usually downloads block files before they are re- 
quested but sometimes it can’t keep up. At that time, 
some block files are downloaded twice, increasing the 
traffic. Figures 11 and 12 showed these effects were 
small, less than 30 MB. The network latency reduction 
techniques were effective and enabled a quick boot. 
Boot Time 

Tables 2 and 3 show the boot times for Xen- 
HVM and KQEMU. To compare them, we added the 
boot time of the USB Memory (normal), which is nor- 
mally installed Debian on the USB Memory, and the 
boot time of the cached block files on the USB memo- 
ry (cache). 

The difference between normal and cached boot- 
ing is the effect of compressed block files. In Tables 2 
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Figure 11: Amount of traffic (left) and throughput (right) at boot time on Xen-HVM under no latency (upper) and 
30 msec latency (lower). 
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and 3, the difference didn’t appear because most time 
was consumed by boot sequence itself. 

Adding the optimizations improved ext2optimiz- 
er’s boot times in all cases due to the reduction in down- 
loaded block files. This effect grew as network latency 
was increased. The ratio of boot time with no optimiza- 
tion to that with ext2optimizer approaches the reduction 
of traffic (to 40% and 47% on Xen-HVM and KQEMU). 


When block files were downloaded from the net- 
work, DLAHEAD became available. DLAHEAD starts 
even before a guest OS boots (usually by about five 
seconds), as soon as the virtual disk is set up for a virtu- 
al machine. 


If DLAHEAD is not available, the tables show 
HTTP-FUSE CLOOP’s performance is quite vulnerable 
to network latency. The boot times with 30 msec delay 
were more than 2x slower than 0 msec latency. The 
combination of ext2optimizer and DLAHEAD reduced 
the boot times to less than 41 seconds in each case — 
close to the case of cached optimized block files. These 
two types of optimization are necessary for reasonable 
performance of HTTP-FUSE CLOOP. 


Throughput 


Since download requests are usually sequential 
and the size of each download is small (average 
100KB), network latency can dramatically reduce a 
system’s performance. 


DLAHEAD increased the network throughput to 
about 90 Mbps on both of Xen-HVM and QEMU 
when the network latency was 0 msec, finishing the 
download in five seconds. With 0 msec latency, though, 
its improvement was diminished. When the network la- 
tency was 30 msec, throughput was only 17 Mbps. 
However, the download finished in 25 seconds and the 
boot times were almost same as the cases of 0 msec 
network latency. These results showed the power of 
DLAHEAD for Internet-based servers. 


Discussion 


Linking Vulnerability Databases 


OS Circular offers a framework for Internet 
clients. The disk image is updated by the package 
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manager (though there is no way to authenticate it). 
The disk image should be linked to Vulnerability 
Databases to check the contents. 


The target vulnerability database is CVE (Com- 
mon Vulnerabilities and Exposures) [16], operated by 
the MITRE corporation which includes vulnerability 
information for packages in each OS. The OVAL 
(Open Vulnerability Assessment Language) provides a 
uniform mechanism to report on and control security. 


We will apply CVE and offer the information for 
anonymous users to aid OS selection decisions. 


Trusted Boot to Detect Rootkit 


The current implementation has to trust VM- 
KNOPPIX. If VMKNOPPIXcontains a rootkit or mal- 
ware on the virtual machine, we have no way to detect 
it [17, 18]. 

The Trusted Computing Group (TCG) [19] pro- 
motes open standards for hardware-enabled trusted 
computing. It has released specifications for the Trust- 
ed Platform Module (TPM) chip which is often avail- 
able on current PCs and is used for trusted boot. 


We have a plan to integrate trusted boot (e.g., 
Trusted GRUB [20] and IMA Linux kernel [21, 22]) 
into VMKNOPPIX. Trusted GRUB and IMA (Integri- 
ty measurement Architecture) keep a log of equipped 
devices and opened files in the TPM. The log is sealed 
by the key of TPM and sent to a remote attestation 
server to certify the trusted boot. 


Live Update 


Some researchers have proposed live update of 
OSes on virtual machines, e.g., Intel vPro and LUCOS 
[23]. Virtual machines and the Guest OS communicate 
with each other and enable live update for security. 
Currently, OS Circular has no function for live update, 
but we hope to integrate it in the future. 

PlayStation 3 

OS Circular has been also been applied to other 

devices which run Linux. We have even applied OS 


Circular to a game machine “‘PlayStation 3”; we call 
it HTTP-FUSE PS3 Linux [24]. 
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Table 2: Boot time on Xen-HVM (sec). 
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Table 3: Boot time on KQEMU (sec). 
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The most important features are the unified de- 
vice model and preparation of the block device before 
booting. The device model of PlayStation3 is unified, 
and the boot loader, ““kboot” [25], is stored in a 4 MB 
built-in Flash memory. kboot can download a kernel 
and a miniroot via HTTP from the Internet. This “mini- 
root” includes the driver for HTTP-FUSE CLOOP and 
the root filesystem is obtained by HTTP-FUSE CLOOP. 


Conclusions 


We have proposed OS Circular, a client-centric 
OS migration system. OS Circular utilizes easily ob- 
tained infrastructure, including Web hosting and secu- 
rity update service. OS images are maintained by a se- 
curity management paradigm on the client, checking 
the validity of data blocks as they arrive. Performance 
improvements have enabled OS Circular to achieve 
performance almost as good as local disks. Future 
plans include the integration of Trusted Boot and Vul- 
nerability Database checking. 
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ABSTRACT 


Existing applications often contain security holes that are not patched until after the system 
has already been compromised. Even when software updates are available, applying them often re- 
sults in system services being unavailable for some time. This can force administrators to leave 
system services in an insecure state for extended periods. To address these system security issues, 
we have developed the PeaPod virtualization layer. The PeaPod virtualization layer provides a 
group of processes and associated users with two virtualization abstractions, pods and peas. A pod 
provides an isolated virtualized environment that is decoupled from the underlying operating sys- 
tem instance. A pea provides an easy-to-use least privilege model for fine grain isolation amongst 
application components that need to interact with one another. As a result, the system easily en- 
ables the creation of lightweight environments for privileged program execution that can help with 
intrusion prevention and containment. Our measurements on real world desktop and server appli- 
cations demonstrate that the PeaPod virtualization layer imposes little overhead and enables secure 


isolation of untrusted applications. 


Introduction 


Security problems can wreak havoc on an orga- 
nization’s computing infrastructure. To prevent this, 
software vendors frequently release patches that can 
be applied to address security issues that have been 
discovered. However, software patches need to be ap- 
plied to be effective. It is not uncommon for systems 
to continue running unpatched applications long after 
a security exploit has become well-known [25]. This is 
especially true of the growing number of server appli- 
ances intended for very low-maintenance operation by 
less-skilled users. Furthermore, by reverse engineering 
security patches, attackers have been able to release 
exploits less than a month after the vulnerability is 
patched [16]. 


This impacts system administrators, as even with 
security patches being released, one cannot always ap- 
ply them in a timely manner. First, many security 
patches require that the system service being patched be 
taken off-line, thereby making it unavailable. Patching 
an operating system can result in the entire system hav- 
ing to be down for some period of time. If a system ad- 
ministrator chooses to fix an operating system security 
problem immediately, he risks upsetting his users be- 
cause of loss of data. Therefore, a system administrator 
must schedule downtime in advance and in cooperation 
with all the users, leaving the computer vulnerable until 
repaired. Furthermore, just because a security patch is 
released, does not mean it will apply successfully to 
one’s system. If the system service is patched success- 
fully, the system downtime may be limited to just a 
few minutes during the reboot. However, if the patch 
is not successful, downtime can extend for many hours 
while the problem is diagnosed and a solution is 


found. Therefore, a system administrator will have to 
delay applying the security patch until one is sure that 
it will cause only a minimum amount of downtime. 


Second, many system services in use today are 
supplied as appliances. Just like one’s physical appli- 
ances are simple single task machines, computing ap- 
pliances, be they commercial appliances, such as a 
TiVo or a NetApp Filer, or a simplified appliance a 
corporation deploys internally, such as a web or mail 
server appliance, are simplified single task systems. 
A primary advantage of computing appliances is that 
they can be deployed very easily by less-skilled 
users. However, this can result in them being set up, 
left running, and forgotten about since they ‘“‘just 
work.” As with all software systems, they will suffer 
from bugs, some of which can have large security im- 
plications. Since these appliances are meant to be put 
into use by people who are not skilled in system ad- 
ministration, one can end up deploying a large num- 
ber of systems that are vulnerable to be taken over 
and used maliciously, without the owner of the appli- 
ance having any knowledge that this has occurred. 
Today, actively used personal machines are being ac- 
tively taken over and used as part of large bot-nets 
without any knowledge of the owners of the ma- 
chines [6]. In the future where large numbers of com- 
puting appliances will be deployed, this problem will 
become significantly worse. 


There are many principles that are used to in- 
crease the security of a software system and limit the 
damage that can occur if security is breached [26]. One 
of the most important is ensuring that one operates in a 
Least Privilege environment. Least Privilege environ- 
ments requires that a user or a program only have ac- 
cess to the resources that are required to complete their 
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job. Even if the user or service’s environment is ex- 
ploited, the attacker will be constrained. For a system 
with many distinct users and uses, designing a least 
privilege system can prove to be very difficult as many 
independent application systems can be used in many 
different and unknown ways. On the other hand, secur- 
ing a single service, such as a software appliance, is 
more tractable due to the limited nature of what the ser- 
vice accesses. 


A common approach to providing least privilege 
environment to a single service is a sandbox container 
environment. Many sandbox container environments 
have been developed to isolate untrusted applications, 
however, many of these approaches have suffered 
from being too complex and too difficult to configure 
to use in practice, and have often been limited by an 
inability to work seamlessly with existing system tools 
and applications. Virtual machine monitors (VMMs) 
offer a more attractive approach by providing a much 
easier-to-use isolation model of virtual machines, which 
look like separate and independent systems apart from 
the underlying host system. However, because VMMs 
need to run an entire operating system instance in each 
virtual machine, the granularity of isolation is very 
coarse, enabling malicious code in a virtual machine 
to make use of the entire set of operating system re- 
sources. Multiple operating instances also need to be 
maintained, adding administrative overhead. 


A primary problem with a sandbox container that 
attempts to isolate a single service is that many ser- 
vices are composed of many interdependent programs. 
Each individual application that makes up the service 
has their own set of requirements. However, since they 
will all be run within the same sandbox container, 
each individual application will end up with access to 
the superset of resources that are needed by all the 
programs that make up the service, thereby negating 
the least privilege principle. One cannot divide the 
programs into distinct sandbox container environ- 
ments since many programs are interdependent and 
expect to work from within a single context. 


We present PeaPod, a virtualization layer that 
provides an easy-to-use abstraction that can be used at 
the granularity of individual applications. The PeaPod 
virtualization layer provides virtual machine isolation 
without the need to run multiple operating system in- 
stances. PeaPod further enables fine-grain isolation 
among application components that may need to inter- 
act within a single machine environment. PeaPod pro- 
vides its functionality without modifying, recompiling, 
or relinking applications or operating system kernels. 


PeaPod combines two key virtualization abstrac- 
tions in its virtualization layer. First, it leverages the 
pod (PrOcess Domain) [20, 22] to provide a sandbox 
container for entire services to run within. A pod is a 
lightweight environment that mirrors the underlying op- 
erating system environment. PeaPod isolates processes 


Potter, Nieh, & Selsky 


in pods from underlying system by associating virtual 
identifiers with operating system resources and only al- 
lowing access to resources that are made available 
within the pod virtualized namespace. Since the pod 
virtualization layer provides a virtual machine like en- 
vironment, it also defines its own set of users, which 
can be distinct from those supported by the underlying 
system. Since it does not run an operating system in- 
stance, a pod prevents malicious code from making use 
of an entire set of operating system resources. Second, 
it introduces peas (Protection and Encapsulation Ab- 
straction). A pea is an easy-to-use least privilege mech- 
anism that enables further isolation among application 
components that need to share limited system resources 
within a pod. It can prevent compromised application 
components from attacking other components within 
the same pod. A pea provides a simple resource-based 
model that restricts access to other processes, IPC, file 
system, and network resources within a pod. 


PeaPod improves upon previous approaches by 
not requiring any operating system modifications, as 
well as avoiding the time of check, time of use race 
conditions that affect many of them [31]. For instance, 
unlike other approaches that perform file system secu- 
rity checks at the system call level and therefore do 
not check the actual file system object that the operat- 
ing system uses, PeaPod leverages stackable file sys- 
tem to integrate directly into the kernel’s file system 
security framework. PeaPod is designed to avoid the 
time of check, time of use race conditions that affect 
previous approaches by performing all file system se- 
curity checks within the regular file system security 
paths and on the same file system objects that the ker- 
nel itself uses. 


This paper describes how the PeaPod system can 
isolate applications to limit their ability to attack a sys- 
tem. The next section describes the PeaPod’s virtual- 
ization abstractions in further detail followed by the 
virtualization architecture to support PeaPod. The next 
two sections provide a security analysis of the PeaPod 
system as well as examples of how to use PeaPod. 
Then the experimental results evaluating the overhead 
associated with PeaPod and measures the system per- 
formance of providing secure isolation for several ap- 
plication scenarios are presented followed by related 
work. Finally, we present some concluding remarks. 


PeaPod Model 


The PeaPod model is based on a virtualization 
abstraction called a pod (PrOcess Domain). A pod 
looks just like a regular machine and provides the 
same application interface as the underlying operating 
system. Pods can be used to run any application, privi- 
leged or otherwise, without modifying, recompiling, 
or relinking applications. This is essential for both 
easy-of-use and protection of the underlying system, 
since applications not executing in a pod offer an op- 
portunity to attack the system. Processes within a pod 
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can make use of all available operating system ser- 
vices, just like processes executing in a traditional op- 
erating system environment. 


A pod does not run an operating system instance, 
it instead provides a virtualized machine environment 
by providing a host-independent virtualized view of 
the underlying host operating system. This is done by 
providing each pod with its own private, virtual 
namespace. All operating system resources are only 
accessible to processes within a pod through the pod’s 
private, virtual namespace. 


A pod namespace is private in that only process- 
es within the pod can see the namespace. It is private 
in that it masks out resources that are not contained 
within the pod. Processes inside a pod appear to one 
another as normal processes that can communicate us- 
ing traditional IPC mechanisms. Other processes out- 
side a pod do not appear in the namespace and are 
therefore not able to interact with processes inside a 
pod using IPC mechanisms such as shared memory or 
signals. Instead, processes outside the pod can only in- 
teract with processes inside the pod using network 
communication and shared files that are normally used 
to support process communication across machines. 


A pod namespace is virtual in that all operating 
system resources including processes, user informa- 
tion, files, and devices are accessed through virtual 
identifiers within a pod. These virtual identifiers are 
distinct from host-dependent resource identifiers used 
by the operating system. The pod virtual namespace 
provides a host-independent view of the system by us- 
ing virtual identifiers that remain consistent through- 
out the life of a process in the pod, regardless of 
whether the pod moves from one system to another. 


The pod private, virtual namespace enables se- 
cure isolation of applications by providing complete 
mediation to operating system resources. Pods can re- 
strict what operating system resources are accessible 
within a pod by simply not providing identifiers to 
such resources within its namespace. A pod only 
needs to provide access to resources that are needed 
for running those processes within the pod. It does not 
need to provide access to all resources to support a 
complete operating system environment. An admini- 
strator can configure a pod in the same way one con- 
figures and installs applications on a regular machine. 


For example, if one had a web server that just 
serves static content, one can easily setup a web server 
pod to only contain the files the web server needs to 
run and the content it wants to serve. The web server 
pod could have its own IP address, decoupling its net- 
work presence from the underlying system. It could al- 
so limit network access to client-initiated connections. 
If the web server application gets compromised, the 
pod limits the ability of an attacker to further harm the 
system since the only resources he has access to are 
the ones explicitly needed by the service. Furthermore, 
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there is no need to carefully disable other network ser- 
vices commonly enabled by the operating system that 
might be compromised within the pod since there is no 
operating system running in the pod. 


Pods can be used in conjunction with peas (Pro- 
tection and Encapsulation Abstraction). While pods 
separate processes into separate machine environ- 
ments, a pea can be used in a pod to provide fine- 
grain isolation among application components that 
may need to interact within a single machine envi- 
ronment, such as using interprocess communication 
mechanisms, including signals, shared memory, IPC 
messages and semaphores, and process forking and 
execution. 


A pea is an abstraction that can contain a group 
of processes and restrict those processes in interacting 
with processes outside of the pea, and limit their ac- 
cess to only a subset of system resources. Unlike a 
pod, which achieves isolation by controlling what re- 
sources are located within the namespace, a pea 
achieves isolation levels by controlling what system 
resources within a namespace its processes are al- 
lowed to access and interact with. For example, a 
process in a pea can see file system resources and pro- 
cesses available to other peas within a single pod, but 
can be restricted from accessing them. Unlike process- 
es in separate pods, processes in separate peas in a sin- 
gle pod share the same namespace and can be allowed 
to interact using traditional interprocess communica- 
tion mechanisms. Processes can also be allowed to 
move from one pea to another in the same pod. How- 
ever, by default processes in separate peas cannot ac- 
cess any resource that is not made available to its pea, 
be it a process pid, IPC key or file system entry. 


Peas can support a wide range of resource re- 
striction policies. By default, processes contained in a 
pea can only interact with other processes in the same 
pea. They have no access to other resources, such as 
file system and network resources or processes outside 
of the pea. This provides a set of fail safe defaults, as 
any extra access has to be explicitly allowed by the 
administrator. 


The pea abstraction allows for processes running 
on the same system to have varying levels of isolation, 
by running in separate peas. Many peas can be used 
side by side to provide flexibility in implementing a 
least privilege system for programs that are composed 
of multiple components that must work together, but 
do not all need the same level of privilege. One usage 
scenario would be to have a severely resource limited 
pea in which a privileged process executes but allows 
the process to use traditional UNIX semantics to work 
with less privileged programs that are in less resource 
restricted peas. 


For example, peas can be used to allow a web 
server appliance the ability to serve dynamic content 
via CGI in a more secure manner. Since the web 
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server and the CGI scripts need separate levels of priv- 
ilege, as well as different resource requirements, they 
should not have to run within the same security con- 
text. By configuring two separate peas for a web ser- 
vice, one for the web server to run within, and a sepa- 
rate for the specific CGI programs it wants to execute, 
one limits the damage that can occur if a fault is dis- 
covered within the web server. If one manages to exe- 
cute malicious code within the context of the web 
server, one can only make use of resources that are al- 
located to the web server’s pea, as well as only exe- 
cute the specific programs that are needed as CGls. 
Since the CGI programs will also only run within their 
specific security context, the ability for malicious code 
to do harm is severely limited. 


Peas and pods together provide secure isolation 
based on flexible resource restriction for programs as 
opposed to restricting access based on users. Peas and 
pods also do not subvert underlying system restric- 
tions based on user permissions, but instead comple- 
ment such models by offering additional resource con- 
trol based on the environment in which a program is 
executed. Instead of allowing programs with root priv- 
ileges to do anything they want to a system, PeaPod 
enables a system to control the execution of such pro- 
grams to limit their ability to harm a system even if 
they are compromised. 


PeaPod Virtualization 


To support the PeaPod virtualization abstraction 
design of secure and isolated namespaces on commod- 
ity operating systems, we employ a virtualization ar- 
chitecture that operates between applications and the 
operating system, without requiring any changes to 
applications or the operating system kernel. This thin 
virtualization layer is used to translate between the 
PeaPod namespaces and the underlying host operating 
system namespace. It protects the host operating sys- 
tem from dangerous privileged operations that might 
be performed by processes within the PeaPod, as well 
as protecting those processes from processes outside 
of the PeaPod on the host. It also enables program- 
based resource restriction for file access, device ac- 
cess, network access, root privileges, process interac- 
tions, and process transitions among peas. 


Pod Virtualization 


Pods are supported using virtualization mecha- 
nisms that translate between pod virtual resource iden- 
tifiers and operating system resource identifiers. Every 
resource that a process in a pod accesses is through a 
virtual name, which corresponds to an operating sys- 
tem resource identified by a physical name. When an 
operating system resource is created for a process in a 
pod, such as with process or IPC key creation, instead 
of returning the corresponding physical name to the 
process, the pod virtualization layer catches the physi- 
cal name value, creates a shadow identifier with a 
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private virtual name that maps to the physical name 
and returns the private virtual name to the process. 
Similarly, any time a process passes a virtual name to 
the operating system, the virtualization layer catches 
it, and replaces it with the appropriate physical name. 
The key pod virtualization mechanisms used are a sys- 
tem call interposition mechanism and the chroot utility 
with file system stacking to provide each pod with its 
own file system namespace that can be separate from 
the regular host file system. 


Pods can take advantage of the regular user iden- 
tifier (UID) security model to support multiple securi- 
ty domains on the same system running on the same 
operating system kernel. For example, since each pod 
can have its own private file system, each pod can 
have its own /etc/passwd file that determines its list of 
users and their corresponding UIDs. Since the pod file 
system is separate from the host file system, a process 
running in the pod is effectively running in a separate 
security domain from another process with the same 
UID that is running directly on the host system. Al- 
though both processes have the same UID, each 
process is only allowed to access files in its own file 
system namespace. Similarly, multiple pods can have 
processes running on the same system with the same 
UID, but each pod effectively provides a separate se- 
curity domain since the pod file systems are separate 
from one another. Since each pod provides a separate 
security domain, it needs to be viewed as if it is a dis- 
tinct machine. For instance, if two physical machines 
share a writable file system, an attacker could leverage 
flaws in one machine to get programs on the shared 
file system that can be used to exploit the second one. 
While there is value in sharing file system data be- 
tween pods, one has to use the same care in verifying 
the shared file system data with multiple pods as one 
would with multiple independent machines. 


Because the root UID 0 is privileged and treated 
specially by the operating system kernel, pod virtual- 
ization also treat UID 0 processes inside of a pod in a 
special way to prevent them from breaking the pod ab- 
straction, accessing resources outside of the pod, and 
causing harm to the host system. While a pod can be 
configured for administrative reasons to allow full 
privileged access to the underlying system, we focus 
on the case of pods for running application services 
that do not need to be used in this manner. Pods do not 
disallow UID 0 processes, which would limit the 
range of application services that could be run inside 
pods. Instead, pods provide restrictions on such pro- 
cesses to ensure that they function correctly inside of 
pods [22]. 

While a process is running in user space, the 
UID it runs as does not have any effect. Its UID only 
matters when it tries to access the underlying kernel 
via one of the kernel entry points, namely devices and 
system calls. Since a pod already provides a virtual 
file system that includes a virtual /dev with a limited 
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set of secure devices, the device entry point is already 
secured. The only system calls of concern are those 
that could allow a root process to break the pod ab- 
straction. Only a small number of system calls can be 
used for this purpose [22]. 


Pea Virtualization 


Peas are supported using virtualization mecha- 
nisms that label resources and enforce a simple set of 
configurable permission rules to impose levels of iso- 
lation among process running within a single pod. For 
example, when a process is created via the fork() and 
clone() system calls, its process identifier is tagged 
with the identifier of the pea in which it was created. 
Pea’s leverage the pod’s shadow pod process identifier 
and also place it in the same pea as its parent process. 
A process’s ability to access pod resources is then dic- 
tated by the set of access permissions rules associated 
with its pea. Like pod virtualization, the key pea virtu- 
alization mechanisms used are a system call interposi- 
tion mechanism and file system stacking for file sys- 
tem resources. 


Pea virtualization employs system call interposi- 
tion to wrap existing system calls to enforce restric- 
tions on process interactions by controlling access to 
process and IPC virtual identifiers. Since each re- 
source is labeled with the pea in which it was created, 
the system call interposition mechanism checks if the 
pea labels of the calling process and the resource to be 
accessed are the same. For example, if a process in 
one pea would try to send a signal to another process 
in a separate pea by using the kill system call, the sys- 
tem would return an error value of EPERM, as the 
process exists, but this process has no permission to 
signal it. On the other hand, a parent is able to use the 
wait system call to clean up a terminated child process’s 
state, even if that child process is running within a sepa- 
rate pea since wait does not modify a process by affect- 
ing its execution. This is analogous to a regular user be- 
ing able to list the meta data of a file, such as owner 
and permission bits, even if the user has no permission 
to read from or write to the file. 


When a new process is created, it executes in the 
pea security domain of its parent. However, when the 
process executes a new program, one wants the ability 
to transition the pea security domain the new program 
is executing within. Therefore, peas support a single 
type of pea access transition rule that lets a pea deter- 
mine how a process can transition from its current pea 
to another. This transition rule is specified by a pro- 
gram filename and pea identifier. A pea is able to have 
multiple pea access transition rules of this type. The 
rule specifies that a process should be moved into the 
pea specified by the pea identifier if it executes the 
program specified by the given filename. This is use- 
ful when it is desirable to have that new program exe- 
cution occur in an environment with different resource 
restrictions. For example, an Apache web server 
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running in a pea may want to execute its CGI child 
processes in a pea with different restrictions. Pea tran- 
sitioning is supported by interposing on the exec sys- 
tem call and transitioning peas if the process to be ex- 
ecuted matches a pea access transition rule for the cur- 
rent pea. Note that pea access transition rules are one- 
way transitions that do not enable a process to return 
to its previous pea unless its new pea explicitly pro- 
vides for such a transition. 


System call interposition is also used to control 
network access for processes inside the pea. Peas pro- 
vide two networking access rule types, one to allow 
processes in the pea to make outgoing network con- 
nections on a pod’s virtual network adapters, the other 
to allow processes in the pea to bind to specific ports 
on the adapter to receive incoming connections. Pea 
network access rules can allow complete access to a 
pod network adapter, or only allow access on a per 
port basis. Since any network access occurs through 
system calls, peas simply check the options of the net- 
working system call, such as bind and connect, to en- 
sure that it is allowed to perform the specified action. 


Pea virtualization employs a set of file system 
access rules and file systems stacking to provide each 
pea with its own permission set on top of the pod file 
system. To provide a least privilege environment, pro- 
cesses should not have access to file system privileges 
they do not need. For example, while Sendmail has to 
write to /var/spool/mqueue, it only has to read its con- 
figuration from /etc/mail and should not need to have 
write permission on its configuration. To implement 
such a least privilege environment, peas enable files to 
be tagged with additional permissions that overlay the 
respective underlying file permissions. File system 
permissions determine access rights based on the user 
identity of the process while pea file permission rules 
determine access rights based on the pea context in 
which a process is executed. Each pea file permission 
rule can selectively allow or deny use of the underly- 
ing read, write and execute permissions of a file on a 
per pea basis. The underlying file permission is always 
enforced, but pea permissions can further restrict 
whether the underlying permission is allowed to take 
effect. The final permission is achieved by performing 
a bitwise AND operation on both the pea and file sys- 
tem permissions. For example, if the pea permission 
rule allowed for read and execute, the permission set 
of r-x would be triplicated to r-xr-xr-x for the three 
sets of UNIX permissions and the bitwise AND opera- 
tion would mask out any write permission that the un- 
derlying file system allow. This prevents any process 
in the pea from opening the file to modify it. 


Enforcing on disk labeling of every single file, 
such as supported through access control lists provid- 
ed by many modern file systems, is too inflexible if a 
single underlying file system is going to be used for 
multiple disparate pods and peas. Since each pea in 
each pod might make use of similar underlying files 
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but have different permission schemes, storing the pea 
permission data on disk is not feasible. Instead, peas 
support the ability to dynamically label each file with- 
in a pod’s file system based on two simple path match- 
ing permission rules, path specific permission rules 
and directory default permission rules. A path specific 
permission matches an exact path on the file system. 
For instance, if there is a path specific permission for 
/home/user/file, only that file will be matched with the 
appropriate permission set. On the other hand, if there 
is a directory default permission for the directory 
/homel/user/ any file under that directory in the directo- 
ry tree can match it, and inherit its permission set. 


Given a set of path specific and directory default 
permissions for a pea, the algorithm for determining 
what permission matches to what path starts with the 
complete path and walks up the path to the root direc- 
tory until it finds a matching permission rule. The al- 
gorithm can be described in four simple steps: 

1. If the specific path has a path specific permis- 
sion, return that permission set. 

2. Otherwise, choose the path’s directory as the 
current directory to test. 

3. If the directory being tested has a directory de- 
fault permission, return that permission set. 

4. Otherwise set its parent as the current directory 

to test and go back to step 3. 


If there is no path specific permission, the closest 
directory default permission to the specified path be- 
comes the permission set for that path. Since, by de- 
fault, peas give the root directory “‘/” a directory de- 
fault permission denying all permissions, the default 
for every file on the system, unless otherwise specified 
is deny. This ensures the pea’s have a fail safe default 
setup and do not allow access to any files unless speci- 
fied by the administrator. 


The semantics of pea file permission are based 
on file path name. If a file has more than one path 
name, such as via a hard link, both have to be protect- 
ed by the same permission, otherwise depending on 
what order the underlying file is accessed the permis- 
sion set it gets will be determined simply based on the 
path name that was accessed initially. This issue only 
occurs on creating the initial set of pea file access per- 
missions. Once the pea is setup, any hard links that are 
created will obey the regular file system permissions. 
For instance, one is not allowed to create a hard link to 
a file that one does not have permission to. On the oth- 
er hand, if one has permission to access the file, a path 
specific permission rule will be created for the newly 
created file that corresponds to the permission of the 
path name it was linked to. 


The pea architecture makes use of the pod’s 
stackable file system to integrate the pea file system 
namespace restrictions into the regular kernel permis- 
sion model, thereby avoiding time of check, time of 
use race conditions. It accomplishes this by stacking 
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on top of the file system’s lookup function, which fills 
in the respective file’s inode structure, and the permis- 
sion function, which makes use of the stored permis- 
sion data to make simple permission determinations. 
A file system’s permission function is a standard part 
of the operating system’s security infrastructure, so no 
kernel changes are necessary. 


Pea Configuration Rules 
File System 


Many system resources in UNIX, including nor- 
mal files, directories, and system devices, are accessed 
via files so controlling access to the file system is cru- 
cial. Each pea must be restricted to those files used by 
its component processes. This control is important for 
security, because processes that work together do not 
necessarily need the same access rights to files. All 
file system access is controlled by path specific and 
directory default rules, which specify a file or directo- 
ry and an access right, such as read, write, and exe- 
cute. 


The access values for file rules are read, write, 
execute, similar to standard UNIX permissions. For 
convenience, we also define allow and deny, which 
are aliases for all three of read, write, and execute and 
cannot be combined with other access values in the 
same rules. When a path specific or directory default 
rule gives access to a file, it implicitly gives execute, 
but not read or write, access to all parent directories of 
the file, up to the root directory. On the other hand, if a 
path specific rule denies access to a directory, then ac- 
cess to both the directory and the directory contents, 
including subdirectories and files, will be denied, even 
if a separate rule would give access to subdirectories 
or files due to it being the best match. 


pod mailserver { 

pea sendmail { 
path /etc/mail/aliases read 
path /etc/mail/aliases.db read 

} 

pea newaliases { 
path /etc/mail/aliases read 
path /etc/mail/aliases.db read,write 


Rule 1: Example of Read/Write rules. 


Consider the case of the Sendmail mail daemon 
and the newaliases command with regard to the sys- 
tem-wide aliases file. Sendmail runs as the root user 
and needs to be able to read the aliases file in order to 
know to where it should forward mail or otherwise re- 
direct it. newaliases is a symbolic link to sendmail and 
typically also runs as the root user in order to update the 
aliases file and convert it into the database format used 
by the Sendmail daemon. In our example, newaliases 
runs in its own pea and is able to read from /efc/ 
mail/aliases and read from and write to /etc/mail/aliases. 
db. Meanwhile sendmail runs in another pea and is able 
to read both files, but not write to them. We use two path 
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specific rules to express these access rules as described 
in Rule 1. 


pod music { 
pea play { 
path /dev/dsp write 
} 


pea rec { 
path /dev/dsp read 
} 


Rule 2: Protecting a device. 


Similar rules can protect a device like /dev/dsp. 
When a user logins into a system locally, via the con- 
sole, they are typically given control of local devices, 
such as the physical display and the sound card. Any 
application that the user runs has access to read from 
and write to these local devices, even though this priv- 
ilege is not necessary. For example, we want to restrict 
playing and recording of sound files to the play and rec 
applications, which are part of SoX [27]. Rule 2 de- 
scribe the rules that provide the appropriate access to 
the device. 


The other file system rule is dir-default. It uses 
the same access values as path, but it is used to specify 
the default access for files below a directory. Any file 
or sub-directory will inherit the same access flags 
since access is determined by matching the longest 
possible path prefix. Unlike path specific rules, direc- 
tory default rules can deny access to a directory in 
general, while still allowing access to specific files. 
Rule 3 describes a pea that denies access to all files in 
/bin, while only allowing access to /bin/ls. 


pod fileLister { 
pea onlyLs { 
dir-default /bin deny 
path /bin/is allow 


Rule 3: Directory default rule. 


Transition Rules 

In the Sendmail/Procmail use case, sendmail 
forks off and executes a procmail process to deliver the 
mail to the user’s spool. Procmail needs different se- 
curity settings, so it must transition from a Sendmail 
pea to a Procmail pea. Rules must be defined that state 
to which pea a process will transition upon execution. 
When a process calls the execve system call, we exam- 
ine the file name to be executed and perform a longest 
prefix match on all the transition rules. For instance, 
by specifying a directory for a transition, PeaPod will 
cause a pea transition to occur for any program exe- 
cuted that is located in that directory, unless there’s a 
more specific transition rule available. 


Rule 4 creates a pea for Sendmail and Procmail, 
and specifies that a process should transition when the 
procmail program is executed. 


Secure Isolation of Untrusted Legacy Applications 


pod mailserver { 
pea sendmail { 
transition /usr/bin/procmail 
} 
pea procmail { 


} 


procmail 


} 
Rule 4: Transition rules. 


PeaPod does not provide the ability for a process 
to transition to another pea besides by executing a new 
program. If it could, a process could open an allowed 
file in one pea and then transition to another pea 
where access to that file was not allowed and thus cir- 
cumvent the security restrictions. 


Networking Rules 


PeaPod provides two rules that define the net- 
work capabilities a pea exposes to the processes run- 
ning within it. First, peas are able to restrict a process 
from instantiating an outgoing connection. Second, 
peas are able to limit what ports a process can bind to 
and listen for incoming connections. By default, peas 
do not let processes make any outgoing connections or 
bind to any port. While a full network firewall is an 
important part of any security architecture, it is or- 
thogonal to the goals of PeaPod and therefore belongs 
in its own security layer. 


Continuing the simplified Sendmail/Procmail us- 
age case, an administrator would want to easily con- 
fine the network presence of processes running within 
Sendmail/Procmail peas. By allowing sendmail to 
make outgoing connections, to enable it to send mes- 
sages, as well as bind to port 25, the standard port for 
receiving messages, Sendmail can continue to work 
normally. On the other hand, processes run within the 
procmail pea, which will be less restricted, are not al- 
lowed to bind to any port for this same reason. On the 
other hand, programs run from within the procmail 
pea are allowed to initiate outgoing network connec- 
tions. This allows programs, such as spam filters that 
require checking network based information, to con- 
tinue to work. 


pod mailserver { 
pea sendmail { 
outgoing allow 
bind tep/25 
} 
pea procmail { 
outgoing allow 


} 


Rule 5: Networking rules. 


Shared Namespace Rules 


PeaPod provides a single namespace rule for en- 
abling processes to access the pod’s virtual private 
identifiers that do belong to its personal pea. PeaPod 
enables peas to be configured to only have access to 
resources tagged with specific pea identifiers or with 
the special global pea identifier that enables access to 
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every virtual private resource in the pod. A common 
usage of this rule is to enable the creation of a global 
pea with access to all the resources of a pod, for in- 
stance to be enable a process to startup and shutdown 
services run within a resource restricted pea. Rule 6 
describes a pod that has a global pea that is able to ac- 
cess every private virtual identifier in the pod, as well 
as pea that is able to access the virtual identifiers that 
belong to one of its sibling peas. 


pod service { 
pea global_access { 
namespace global 


pea testl { 
namespace test2 


pea test { 


Rule 6: Namespace access rules. 


Managing Rules 


To make it simpler for administrators to create 
peas in a pod, we allow groups of rules to be saved to 
a file and included in the main configuration file for a 
given PeaPod configuration. These groups of rules 
would typically describe the minimum resources nec- 
essary for a single application. Application packagers 
can include rule group files in their package and ad- 
ministrators can share rule groups with each other. 


path /usr/bin/gec read,execute 
dir-default /usr/lib/gcc-lib read,execute 
path /usr/bin/cpp read,execute 
path /usr/lib/libiberty.a read 

path /usr/bin/ar read,execute 
path /usr/bin/as read,execute 
path /usr/bin/1d read,execute 
path /usr/bin/ranlib read,execute 
path /usr/bin/strip read,execute 


Rule 7: Compiler rules. 


A rule group, such as Rule 7 for a compiler, 
would be stored in a central location. An administrator 
uses an include rule to reference the external file as 
part of a development PeaPod. Rule 8 contains the 
tools necessary to build a Linux kernel from source; 
and permits access to the source code itself and a 
writable directory for the binaries. 


pod workstation { 
pea kernel-development { 

include "stdlibs" 
include "compiler" 
include "tar" 
include "bzip2" 
dir-default /usr/local/src/ read 
dir-default /scratch/binaries allow 


Rule 8: Set of multiple rule files. 


These management rules demonstrate PeaPod’s 
ability to distinguish the minimal needs of a program 


Potter, Nieh, & Selsky 


service in order to execute, while enabling an admini- 
strator to define a local policy that can restrict what lo- 
cal resources the program service has access to. The 
knowledge needed to build a set of rules for a program 
service that provides the minimal needed set of re- 
sources to execute is not always readily available to 
users of security systems. However, this knowledge is 
available to the authors and distributors of the system. 
PeaPod’s management rules enable the creation and 
distribution of rule files that define the minimal set of 
resources needed to execute a program service, while 
enabling the local administrator to further define the 
resources restriction policy. 


Security Analysis 


Saltzer and Schroeder [26] describe several prin- 
ciples for designing and building secure systems. 
These include: 

© Economy of mechanism: Simpler and smaller 
systems are easier to understand and ensure that 
they do not allow unwanted access. 

° Fail safe defaults: Systems must choose when 
to allow access as opposed to choosing when to 
deny. 

° Complete mediation: Systems should check ev- 
ery access to protected objects. 

° Least privilege: A process should only have ac- 
cess to the privileges and resources it needs to 
do its job. 

¢ Psychological acceptability: If users are not 
willing to accept the requirements that the secu- 
rity system imposes, such as very complex 
passwords that the users are forced to write 
down, security is impaired. Similarly, if using 
the system is too complicated, users will mis- 
configure it and end up leaving it wide open. 

° Work factor: Security designs should force an 
attacker to have to do extra work to break the 
system. The classic quantifiable example is 
when one adds a single bit to an encryption key, 
one doubles the key space an attacker has to 
search. 


PeaPod is designed to satisfy these six principles. 
PeaPod provides economy of mechanism using a thin 
virtualization layer based on system call interposition 
and file system stacking that only adds a modest amount 
of code to a running system. The largest part of the sys- 
tem is due to the use of a null stackable file system with 
7000 lines of C code, but this file system was generated 
using a simple high-level file system language [33], and 
only 50 lines of code were added to this well tested file 
system to implement the PeaPod file system security. 
Furthermore, PeaPod changes neither applications nor 
the underlying operating system kernel. The modest 
amount of code to implement PeaPod makes the system 
easier to understand. Since the PeaPod security model 
only provides resources that are explicitly stated, it is 
relatively easy to understand the security properties of 
resource access provided by the model. 
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PeaPod provides fail safe defaults by only pro- 
viding access to resources that have been explicitly 
given to peas and pods. If a resource is not created 
within a pea, or explicitly made available to that pea, 
no process within that pea will be allowed to access it. 
While a pea can be configured to enable access to all 
resources of the pod, this is an explicit action an ad- 
ministrator has to take. 


PeaPod provides for complete mediation of all 
resources available on the host machine by ensuring 
that all resource accesses occur through the pod’s vir- 
tual namespace. Unless a file, process, or other operat- 
ing system resource was explicitly placed in the pod 
by the administrator or created within the pod, Pea- 
Pod’s virtualization will not allow a process within a 
pod to access the resource. 


PeaPod’s provide a least privilege environment 
in two ways. First, pods provide a least privilege envi- 
ronment by enabling an administrator to only include 
the data necessary for each service. PeaPod can pro- 
vide separate pods for individual services so that sepa- 
rate services are isolated and restricted to the appropri- 
ate set of resources. Even if a service is exploited, 
PeaPod will limit the attacker to the resources the ad- 
ministrator provided for that service. While one can 
achieve similar isolation by running each individual 
service on a separate machine, this leads to inefficient 
use of resources. PeaPod maintains the same least 
privilege semantic of running individual services on 
separate machines, while making efficient use of ma- 
chine resources at hand. For instance, an administrator 
could run MySQL and Sendmail mail transfer services 
on a single machine, but within different pods. If the 
Sendmail pod gets exploited, the pod model ensures 
that the MySQL pod and its data will remain isolated 
from the attacker. Furthermore, PeaPod’s peas are ex- 
plicitly designed to enable least privileged environ- 
ments by restricting programs in an environment that 
can be easily limited to provide the least amount of ac- 
cess for the encapsulated program to do its job. 


PeaPod provides psychological acceptability by 
leveraging the knowledge and skills system adminis- 
trators already use to setup system environments. Be- 
cause pods provide a virtual machine model, adminis- 
trators can use their existing knowledge and skills to 
run their services within pods. Furthermore, peas use a 
simple resource based model that does not require a 
detailed understanding of any underlying operating 
system specifics. This differs from other least privi- 
lege architectures that force an administrator to learn 
new principles or complicated configuration languages 
that require a detailed understanding of operating sys- 
tem principles. 

Similar to least privilege, PeaPod increases the 
work factor that it would take to compromise a system 
by simply not making available the resources that at- 
tackers depend on to harm a system once they have 
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broken in. For example, since PeaPod can provide se- 
lective access to what program are included within their 
view, it would be very difficult to get a root shell on a 
system that does not have access to any shell program. 


Usage Examples 


We briefly describe three examples that help il- 
lustrate how the PeaPod virtualization layer can be 
used to improve computer security and application 
availability for different application scenarios. The ap- 
plication scenarios are e-mail delivery, web content 
delivery, and desktop computing. In the following ex- 
amples we make extensive use of PeaPod’s ability to 
compose rule files in order to simplify the rules. In- 
stead of listing every file and library necessary to exe- 
cute a program, we isolate them into a separate rule 
file to place the focus on the actual management of the 
service the pea is trying to protect. 


E-mail Delivery 


For e-mail delivery, PeaPod’s virtualization layer 
can isolate different components of e-mail delivery to 
provide a significantly higher level of security in light 
of the many attacks on Sendmail vulnerabilities that 
have occurred. Consider isolating a Sendmail installa- 
tion that also provides mail delivery and filtering via 
Procmail. E-mail delivery services are often run on the 
same system as other Internet services to improve re- 
source utilization and simplify system administration 
through server consolidation. However, this can pro- 
vide additional resources to services that do not need 
them, potentially increasing the damage that can be 
done to the system if attacked. 


pod mail-delivery { 
pea sendmail { 
include "stdlibs" 
include "sendmail" 


dir-default /etc read 
dir-default /var/spool/mqueue allow 
dir-default /var/spool/mail allow 
dir-default /var/run allow 
path /usr/bin/procmail read, execute 
transition /usr/bin/procmail procmail 
bind tep/25 
outgoing allow 


} 
pea procmail { 
dir-default / allow 
outgoing allow 
} 
} 


Rule 9: E-Mail delivery configuration. 


As shown in Rule 9, using PeaPod’s virtualiza- 
tion layer, both Sendmail and Procmail can execute in 
the same pod, which isolates e-mail delivery from oth- 
er services on the system. Furthermore, Sendmail and 
Procmail can be placed in separate peas, which allows 
necessary interprocess communication mechanisms be- 
tween them while improving isolation. This pod is a 
common example of a privileged service that has child 
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helper applications. In this case, the Sendmail pea is 
configured with full network access to receive e-mail, 
but only with access to files necessary to read its con- 
figuration and to send and deliver email. Sendmail 
would be denied write access to file system areas such 
as /usr/bin to prevent modification to those executables, 
and would only be allowed to transition a process to 
the Procmail pea if it is executing Procmail, the only 
new program its pea allows it to execute. On mail de- 
livery, Sendmail would then exec Procmail, which 
transitions the process into the Procmail pea. The 
Procmail pea is configured with a more liberal access 
permission, namely allowing access to the pod’s entire 
file system, enabling it to run other programs, such as 
SpamAssassin. While an administrator could config- 
ure programs Procmail executes, such as SpamAssas- 
sin, to run within their own Peas, this case keeps them 
all within a single pea to demonstrate how simple a 
system can be. As a result, the Sendmail/Procmail pod 
can provide full e-mail delivery service while isolating 
Sendmail such that even if Sendmail is compromised 
by an attack, such as a buffer overflow, the attacker 
would be contained in the Sendmail pea and not even 
be able to execute processes, such as a root shell, to 
further compromise the system. 


Web Content Delivery 


For web content delivery, PeaPod’s virtualization 
layer can isolate different components of web content 
delivery to provide a significantly higher level of se- 
curity in light of common web server attacks that may 
exploit CGI script vulnerabilities. Consider isolating 
an Apache web server front end, a MySQL database 
back-end, and CGI scripts that interface between 
them. While one could run Apache and MySQL in 
separate pods, since they are providing a single ser- 
vice, it makes sense to run them within a single pod 
that is isolated from the rest of the system. However, 
since both Apache and MySQL are within the pod’s 
single namespace, if an exploit is discovered in Apache, 
it could be used to perform unauthorized modifications 
to the MySQL database. 


To provide greater isolation among different web 
content delivery components, Rule 10 describes a set 
of three peas in a pod: one for Apache, a second for 
MySQL, and a third for the CGI programs. Each pea 
is configured to contain the minimal set of resources 
needed by the processes running within the respective 
pea. The Apache pea includes the apache binary, con- 
figuration files and the static HTML content, as well 
as a transition permission to exec all CGI programs in- 
to the CGI pea. The CGI pea contains the relevant 
CGI programs as well as access to the MySQL dae- 
mon’s named socket, allowing interprocess communi- 
cation with the MySQL daemon to perform the rele- 
vant SQL queries. The MySQL pea contains the mysql 
daemon binary, configuration files and the files that 
make up the relevant databases. Since Apache is the 
only program exposed to the outside world, it is the 
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only process that can be directly exploited. However, 
if an attacker is able to exploit it, the attacker is limit- 
ed to a pea that is only able to read or write specific 
Apache files, as well as exec specific CGI programs 
into a separate pea. Since the only way to access the 
database is through the CGI programs, the only access 
to the database an attacker would have is what is al- 
lowed by said programs. Consequently, the ability of 
an attacker to cause serious harm to such a web con- 
tent delivery system running with PeaPod’s virtualiza- 
tion layer is significantly reduced. 


pod web-delivery { 

pea apache { 
include "stdlibs" 
path /usr/sbin/apache 
path /usr/sbin/apachectl read,execute 
dir-default /var/www read,execute 
transition /var/www/cgi-bin cgi 
bind tep/80 

} 

pea cgi { 
include "stdlibs" 
include "perl" 
dir-default /var/www/data allow 
path /tmp/mysql.sock allow 

} 

pea mysql { 
include "stdlibs" 
path /usr/sbin/mysqld read, execute 
path /tmp/mysql.sock allow 
dir-default /usr/share/mysql read 
dir-default /var/lib/mysql allow 


read,execute 


Rule 10: Web delivery rules. 


Desktop Computing 


For desktop computing, PeaPod’s virtualization 
layer enables desktop computing environments to run 
multiple desktops from different security domains 
within multiple pods. Peas can also be used within the 
context of such a desktop computing environment to 
provide additional isolation. Many application used on 
a daily basis, such as mp3 players and web browsers, 
have had security holes. These holes enable attackers 
to execute malicious code or gain access to the entire 
local file system [12, 13]. Rule 11 describes a set of 
PeaPod rules that are used to contain a small set of 
desktop applications being used by a user with the 
/home/spotter home directory. 


To secure an mp3 player, a pea can be created 
within the desktop computing pod that restricts the 
mp3 player’s ability to make use of files outside of a 
special mp3 directory. Since most users store their mu- 
sic within its own subtree, this isn’t a serious restric- 
tion. Most mp3 content should not be trusted, espe- 
cially if one is streaming mp3s from a remote site. By 
running the mp3 player within this fully restricted pea, 
a malicious mp3 cannot compromise the user’s desk- 
top session. This mp3 player pea is simply configured 
with four file system permissions. A path specific per- 
mission that provides access to the mp3 player itself is 


126 21st Large Installation System Administration Conference (LISA ’07) 


Potter, Nieh, & Selsky 


required to load the application. A directory default 
permission that provides access to the entire mp3 di- 
rectory subtree is required to give the process access 
to the mp3 file library. A directory-default permission 
to a directory meant to store temporary files so the 
mp3 player can be used as a helper application. Final- 
ly, a path specific permission that provides access to 
the /dev/idsp audio device is required to allow the 
process to play audio. 


pod desktop { 

pea firefox { 
include "firefox" 
dir-default /home/spotter/.mozilla allow 
dir-default /home/spotter/tmp allow 
dir-default /home/spotter/download allow 
transition /usr/bin/mpg123 mpg123 
transition /usr/bin/acroread acroread 

} 

pea mpgl23 { 
include "stdlibs" 


path /usr/bin/mpg1l23 read, execute 


path /dev/dsp write 
dir-default /home/spotter/tmp allow 
dir-default /home/spotter/music allow 


} 
pea acroread { 
include "stdlibs" 
include "acroread" 
dir-default /home/spotter/tmp allow 
} 
} 


Rule 11: Desktop application rules. 


To secure a web browser, a pea can be created 
within a desktop computing pod that restricts the web 
browser’s access to system resources. Consider the 
Mozilla Firefox web browser as an example. A Fire- 
fox pea would need to have all the files Firefox needs 
to run accessible from within the pea. Mozilla dynami- 
cally loads libraries and stores them along with its 
plugins within the /usr/lib/firefox directory. By providing 
a directory default permission that provides access to 
that directory, as well as another directory default per- 
mission that provides access to the user’s .mozilla di- 
rectory, the Firefox web browser can run as normal 
within this special Firefox pea. Users also want the 
ability to be able to download and save files, as well 
as launch viewers, such as for postscript or mp3 files, 
directly from the web browser. This involves a simple 
reconfiguration of Firefox to change its internal appli- 
cation.tmp_dir variable to be a directory that is writable 
within the Mozilla pea. By creating such a directory, 
such as download within the users home directory, and 
providing a directory default permission allowing ac- 
cess, we enable one to explicitly save files, as well as 
implicitly save when one wants to execute a helper ap- 
plication. Similarly, just like Mozilla is configured to 
run helper applications for certain file types, one would 
have to configure the Mozilla pea to execute those 
helper applications within their respective peas. As 
shown, for an mp3 player, configuring such a pea for 
these process is fairly simple. The only addition one 
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would have to make is to provide an additional pea 
transition permission to the Mozilla pea that tells the 
PeaPod’s virtualization layer to transition the process to 
a separate pea on execution of programs such as the 
mpg123 mp3 player or the Acrobat Reader PDF viewer. 


Experimental Results 


We implemented PeaPod’s virtualization layer as 
a loadable kernel module in Linux that requires no 
changes to the Linux kernel source code or design. We 
present some experimental results using our Linux 
prototype to quantify the overhead of using PeaPod on 
various applications. Experiments were conducted on 
two IBM Netfinity 4500R machines, each with a 933 
Mhz Intel Pentium-II] CPU, 512 MB RAM, 9.1 GB 
SCSI HD and a 100 Mbps Ethernet connected to a 
3Com Superstack II 3900 switch. One of the machines 
was used as an NFS server from which directories 
were mounted to construct the virtual file system for 
the PeaPod on the other client system. The client ran 
Debian stable with a 2.4.21 kernel. 


Name Description 


getpid average getpid runtime 


ioctl average runtime for the FION- 
READ ioctl 


IPC Shared memory segment 
holding an integer is created and 
removed 


shmget-shmctl 


IPC Semaphore variable is creat- 
ed and removed 


process forks and waits for child 
that calls exit immediately 


Runs Apache 1.3 under load and 
measures average request time 


semget-semctl 
fork-exit 
Apache 


Linux Kernel 2.4.21 compile with 
up to 10 processes active at one 
time 

Use Postmark Benchmark to sim- 
ulate Sendmail performance 


**TPC-W like” interactions 
benchmark that uses Tomcat 4 
and MySQL 4 


Table 1: Application benchmarks. 


Postmark 





To measure the cost of PeaPod’s virtualization 
layer, we used a range of micro benchmarks and real 
application workloads and measured their performance 
on our Linux PeaPod prototype and a vanilla Linux sys- 
tem. Table 1 shows the seven micro-benchmarks and 
four application benchmarks we used to quantify Pea- 
Pod’s virtualization overhead. To obtain accurate mea- 
surements, we rebooted the system between measure- 
ments. Additionally, the system call micro-benchmarks 
directly used the TSC register available on Pentium 
CPUs to record time-stamps at the significant mea- 
surement events. Each time-stamp has an average cost 
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of 58 ns. The files for the benchmarks were stored on 
the NFS Server. All of these benchmarks were per- 
formed in a chrooted environment on the NFS client 
machine running Debian Unstable. Figure 1 shows the 
results of running the benchmarks under both configu- 
rations, with the vanilla Linux configuration normal- 
ized to one. Since all benchmarks measure the time to 
run the benchmark, a small number is better for all 
benchmarks results. 
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Figure 1: PeaPod virtualization overhead. 


The results in Figure 1 show that PeaPod’s virtu- 
alization overhead is small. PeaPod incurs less than 
10% overhead for most of the micro-benchmarks and 
less than 4% overhead for the application workloads. 
The overhead for the simple system call getpid bench- 
mark is only 7% compared to vanilla Linux, reflecting 
the fact that PeaPod virtualization for these kinds of 
system calls only requires an extra procedure call and 
a hash table lookup. 


The most expensive benchmarks for PeaPod is 
semget+semctl, which took 51% longer than vanilla 
Linux. The cost reflects the fact that our untuned Pea- 
Pod prototype needs to allocate memory and do a 
number of namespace translations. The ioctl bench- 
mark also has high overhead, because of the 12 sepa- 
rate assignments it does to protect the call against ma- 
licious root processes. These assignments correspond 
to saving the four variables that store UID state, as- 
signing them a non privileged UID, and then restoring 
the original state. This is large compared to the simple 
FIONREAD ioctl that just performs a simple derefer- 
ence. However, since the ioctl is simple, we see that it 
only adds 200 ns of overhead over any ioctl. 


For real applications, the most overhead was only 
four percent, which was for the Apache 1.3 workload, 
where we used the http_load benchmark [21] to place a 
parallel fetch load on the server with 30 clients fetching 
at the same time. Similarly, we tested MySQL as part of 
a web-commerce scenario outlined by TPC-W with a 
bookstore servlet running on top of Tomcat 4 with a 
MySQL 4 back-end. The PeaPod overhead for this sce- 
nario was less than 2% versus vanilla Linux. These 
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results are directly comparable to the virtualization re- 
sults in AutoPod [22] and are effectively the same, 
demonstrating the additional overhead needed to con- 
fine processes into distinct peas is minimal. 


Related Work 


Many systems have been developed to isolate 
untrusted applications. NSA’s Security Enhanced Lin- 
ux [19], which is based upon the Flask Architecture 
[28], implements a policy language that one can use to 
implement models that enable one to enforce privilege 
separation. The policy language is very flexible but al- 
so very complex to use. The example security policy 
is over 80 pages long. There is research into creating 
tools to make policy analysis tractable [2], but the fact 
that the language is so complex makes it difficult for 
the average end user to construct an appropriate policy. 


System call interception has been used by sys- 
tems such as Janus [30, 10], Systrace [24], MAPbox 
[1], Software Wrappers [15], and Ostia [11]. These 
systems can enable flexible access controls per system 
call, but they have been limited by the difficulty of 
creating appropriate policy configurations. TRON [5], 
SubDomain [7] and Alcatraz [17] also operate at the 
system call level but focus on limiting access to the 
underlying file system. TRON allows transitions be- 
tween different isolation units but requires application 
modifications to use this feature, while SubDomain 
supports an implicit transition on execution of a new 
child process. These systems provide a model some- 
what similar to the file system approach used by Pea- 
Pod peas. However, peas are designed based on a full- 
fledged stackable file system that integrates fully with 
regular kernel security infrastructure and provides 
much better performance. Similarly, the PeaPod’s virtu- 
alization layer provide a complete process isolation so- 
lution that is not just limited to file system protection. 


Safer languages and run-time environments, most 
notably Java, have been developed to prevent common 
software errors and isolate applications in language- 
based virtual machine environments. These solutions 
require applications to be rewritten or recompiled, of- 
ten with some loss in performance. Other language- 
based tools [8, 3] have also been developed to harden 
applications against common attacks, such as buffer 
overflow attacks. PeaPod’s virtualization layer com- 
plements these approaches by providing isolation of 
legacy applications without modification. 


Virtual machine monitors (VMMs) have been 
used to provide secure isolation [29, 32, 4]. Unlike 
PeaPod’s virtualization layer, VMMs decouple pro- 
cesses from the underlying machine hardware, but tie 
them to an instance of an operating system. As a re- 
sult, VMMs provide an entire operating system in- 
stance and namespace for each VM and lack the abili- 
ty to isolate components within an operating system. If 
a single process in a VM is exploitable, malicious 
code can make use of it to access and make use of the 
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entire set of operating system resources. Since Pea- 
Pod’s virtualization layer decouples processes from 
the underlying operating system and its resulting 
namespace, they are natively able to limit the separate 
processes of a larger system to the appropriate re- 
sources needed by them. Furthermore, VMMs require 
more administrative overhead due to requiring admini- 
stration of multiple full operating system instances as 
well imposing higher memory overhead due to the re- 
quirements of the underlying operating system. 


A number of other approaches have explored the 
idea of virtualizing the operating system environment 
to provide application isolation. FreeBSD’s Jail mode 
[14] provides a chroot like environment that processes 
cannot break out of. However, Jail is limited in what it 
can do, such as the fact that it doesn’t allow IPC with- 
in a jail [9], and therefore many real world application 
will not work. More recently, Linux Vserver [18] and 
Solaris Zones [23] offer a similar virtual machine ab- 
straction as PeaPod pods, but require substantial in- 
kernel modifications to support the abstraction. While 
these system’s share the simplicity of the Pod abstrac- 
tion. they do not provide finer-granularity isolation as 
provided with peas. 


Conclusions 


The PeaPod system provides an operating system 
virtualization layer that enables secure isolation of 
legacy applications. The virtualization layer supports 
two key abstractions for encapsulating processes, pods 
and peas. Pods provide an easy-to-use lightweight vir- 
tual machine abstraction that can securely isolate indi- 
vidual applications without the need to run an operat- 
ing system instance in the pod. Peas provide a fine- 
grain least privilege mechanism that can further isolate 
application components within pods. PeaPod’s virtual- 
ization layer can isolate untrusted applications, pre- 
venting them from being used to attack the underlying 
host system or other applications even if they are com- 
promised. 


PeaPod secure isolation functionality is achieved 
without any changes to applications or operating sys- 
tem kernels. We have implemented PeaPod in a Linux 
prototype and demonstrated how peas and pods can be 
used to improve computer security and application 
availability for a range of applications, including e- 
mail delivery, web servers and databases, and desktop 
computing. Our results show that PeaPod’s virtualiza- 
tion layer can provide easily configurable and secure 
environments that can run a wide range of desktop and 
server Linux applications in least privilege environ- 
ments with low overhead. 
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ABSTRACT 


Contemporary storage systems separate the management of data from the management of the 
underlying physical storage media used to store that data. This separation is artificial and increases 
the overall burden of managing such systems. We propose a new management layer that unifies 
data and storage management without any loss of control over either the data or the storage. 


The new management layer consists of three basic entities: data sets, which describe the data 
managed by the system, policies that specify rules for the management of data sets, and resource 
pools that represent storage that can be used to implement the policies for the data sets. Using data 
sets and policies, data administrators are able to directly manage their data, even though that re- 
quires them to indirectly manage the underlying storage infrastructure. The storage administrator 
retains control over what the data administrator can do to the storage. 


This new management layer provides the infrastructure necessary to build mechanisms that 
automate or simplify many administrative tasks, including the necessary coordination between the 
data and storage administrators. We describe Protection Manager, the first tool that uses this man- 
agement layer to present storage management in the context of data management tasks that the 
data administrator wants to perform, and evaluate the effectiveness of using this management layer 


as a way to automate backups. 


Introduction 


Data management, in the form of the manage- 
ment of files, file systems and structured data, has tra- 
ditionally been a separate discipline from the manage- 
ment of underlying storage infrastructure. Data admin- 
istrators have historically concerned themselves with 
the redundancy, performance, persistence and avail- 
ability of their data. The storage administrator has fo- 
cused on delivering physical infrastructure that satis- 
fies the data’s requirements. Typically, the storage is 
configured and then, within the constraints of the con- 
figured storage, data management takes place. If the 
storage requirements of the data change, the data must 
be migrated to different storage or the underlying stor- 
age must be reconfigured. Either process is disruptive 
and requires multiple domain-specific administrators 
work closely together. 


The separation of administrators is a natural out- 
come of the incompatible and inflexible infrastruc- 
tures deployed. Consider a small aspect of data man- 
agement: redundancy. Owners of data need a certain 
amount of redundancy for two reasons. The first is to 
tolerate faults in the physical media, such as disk fail- 
ures or storage array failures. The second is to tolerate 
human and software errors that occur during the ma- 
nipulation of the data. The traditional solution has 
been to use a variety of incompatible storage plat- 
forms from storage vendors to provide differing levels 
of hardware reliability and to use tape for recovering 
from human and software errors. In effect, the under- 
lying infrastructure acts as a “proxy” for managing 
the data. 


Modern virtualization features, such as space- 
efficient replication, result in a storage infrastructure 
that eliminates much of the incompatibility and inflex- 
ibility found in traditional storage environments, but 
data management and storage management remain 
separate disciplines. For example, an administrator for 
a storage system manages Flex Vol [3] volumes and 
aggregates for provisioning storage and uses SnapMir- 
ror [10] and SnapVault [4] for making copies of the 
FlexVol volumes. A data administrator, on the other 
hand, manages files or structured data and thinks in 
terms of copies of their files or structured data. Map- 
ping data management requirements onto storage man- 
agement primitives still requires human interactions 
and complex processes. As a result, even with a flexible 
and homogenous storage infrastructure, data manage- 
ment is still done as if the infrastructure was inflexible 
and incompatible. 


To address the gap between what a storage ad- 
ministrator manages and what a data administrator 
wants to manage, we introduce a new data manage- 
ment framework that we believe can become the basis 
for a new unified storage and data management disci- 
pline. The data management framework consists of a 
set of three new management objects: data sets, re- 
source pools, and policies. It also consists of several 
infrastructure components: role based access control 
[5], a conformance engine, and a centralized database 
that holds the definitions of these objects. 


The data set represents a collection of data and 
all of the replicas of the data. A data set is independent 
of any particular physical infrastructure that is current- 
ly being used to store its data and is the basic entity 
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that a data administrator manages. A resource pool is a 
collection of physical storage resources managed by 
storage administrators. A policy describes the desired 
behavior of the data. Role-based access control (RBAC) 
is used to control access to all managed entities includ- 
ing data sets, policies and resource pools. The confor- 
mance engine is responsible for monitoring and config- 
uring the storage infrastructure such that a data set is al- 
ways in conformance with its policy. 


Using this machinery, a storage administrator 
controls how storage is used by defining specific poli- 
cies and controls the rights to use these policies with 
RBAC. Once a storage administrator has configured 
the policies and access control, data administrators can 
create or administer his or her data sets by assigning 
policies from the appropriate authorized set. Because 
changes in policies allow reconfiguration of the storage, 
and this reconfiguration is done automatically using the 
conformance engine, a substantial increase in efficiency 
results when compared with systems in which manage- 
ment and data management are separated. 


For example, in the case of redundancy, the stor- 
age administrator first configures resource pools with 
varying kinds of physical redundancy, such as RAID 
levels or failover storage capabilities. The storage ad- 
ministrator then constructs policies that can be used to 
create datasets with varying degrees of physical and 
backup redundancy. When a data administrator re- 
quires storage for a data set, he or she selects a policy 
that provides the appropriate levels of redundancy. 
The conformance engine will then provision the actual 
storage required on the appropriate systems and repli- 
cate the data. The conformance engine also monitors 
the storage to handle scenarios in which the data set is 
no longer in conformance with the selected policy. 


The first implementation of this framework, called 
Protection Manager, addresses the particular problem of 
data replica management. The details of this implemen- 
tation are provided in later sections of this paper. 


In the rest of this paper we describe the architec- 
ture of the data management framework and how the 
concepts of data sets, policies and resource pools com- 
bine to solve data protection and provisioning prob- 
lems. We then describe how the machinery was imple- 
mented in the Protection Manager tool. We evaluate 
the extent to which the concepts and framework em- 
power data administrators and simplify administrative 
tasks as policies change. Finally, we describe future 
work and make comparisons to other systems. 


Related Work 


We found a variety of research and systems that 
targeted aspects of our data management framework, 
however, none propose or describe an architecture that 
unifies storage and data management. In particular, none 
described how storage administrators could securely del- 
egate storage management to data administrators such 
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that global policies are enforced and maintained. Al- 
though some research alludes to the need for a common 
language between the storage and data administrator we 
are the first to propose a useable model. 


The notion of separating a logical view of data 
from a physical view in storage is not a new concept. 
Work done at IBM by Gelb describes a similar model 
where one of the primary goals was to isolate physical 
device characteristics from application awareness in ho- 
mogenous IBM environments [6]. The IBM work rec- 
ognized the need for automation as a key to managing 
more data effectively, but appears to concentrate on ini- 
tial provisioning operations, rather than automating any 
reconfigurations of storage due to policy changes. 


Keaton, et al. propose a model for automating 
data dependability that is very similar to our data man- 
agement framework [8]. However, their focus is in the 
mechanics of automating policy design, not in how 
their complete model could in fact be architected. 
Their goal is to try and create a mechanism that would 
allow a storage administrator to automatically select 
the policy given the business goals and the underlying 
technology. Such a mechanism could be incorporated 
in our data management framework, automating the 
manual process of policy design. 


The problem of automating policy design in gen- 
eral has been studied [1, 2, 9]. However, none of these 
papers present a general architecture for how such a 
policy can be used by administrators or software to 
implement the policy that is designed. The approaches 
presented could be incorporated into our framework to 
simplify the design of policies. 

The use of policies and storage pools in backup 
applications is not new. Kaczmarski describes in TSM 
the use of storage pools and policies [7]. However, the 
intent of the policies is not to provide a mechanism to 
allow data administrators to mange their own data or 
to manage the underlying storage, but to provide TSM 
with more information as it makes data management 
decisions. 


Policy-based management systems are also noth- 
ing new. Commercial systems such as Brocade Stor- 
ageX, Opsware, and Symantec all utilize this approach 
for many management tasks. However, these policies 
are intended to automate tasks within an administration 
domain, not as a mechanism to delegate control between 
domains. Patterson, et al. describe the use of policies to 
reduce the cost of storage management by compressing 
or deleting old data in the SRM space [11]. 


Data Set Concepts 


Prior to the introduction of data sets and policies, 
administrators drew little distinction between data and 
the physical containers associated with that data. As a 
result, administrators thought of data in terms of 
where they were located and how they were config- 
ured. As data management requirements changed, 
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storage administrators had to manually map the data to 
the appropriate storage container and change the con- 
figuration associated with that storage container to 
match those new data management requirements. Man- 
aging the mapping and ensuring that the configuration is 
correct were time-consuming and prone to human error. 
As a result, configuration changes were infrequent. The 
net result is that storage was either over-provisioned or 
under-provisioned in terms of capacity, performance, or 
redundancy. 


Data sets and resource pools do not replace the ex- 
isting storage and data management containers; they are 
the management view of those containers. For example, 
a “data set” refers to the data stored by a FlexVol vol- 
ume. When storage is provisioned, a Flex Vol volume is 
actually created on the storage system; however, be- 
cause their configuration is stored externally to the stor- 
age system, a data set can exist for the entire life cycle 
of the data, whereas a particular storage container might 
not. The data might move to another volume or begin 
sharing the volume with data in another data set. We ex- 
pect that, over time, administrators will tend to refer to 
data sets instead of the underlying storage containers. 


Resource Pools 


A resource pool contains storage from which 
data sets can be provisioned. A resource pool can be 
constructed from either, disks, volumes, aggregates or 
entire storage systems. If a resource pool is construct- 
ed using a storage system then we implicitly add all of 
the disks and pre-created aggregates on that storage 
system, including any additional aggregates or disks 
that are later added. There might be many storage sys- 
tems or aggregates from multiple storage systems in a 
single resource pool. The resource pool definition is 
stored in the external database. 


In addition to storage capacity, a resource pool 
also contains the attributes and capabilities of the un- 
derlying storage systems in the pool. These attributes 
include the data access protocols supported, the per- 
formance of the software and storage controller, the 
reliability of the controller, and the data management 
software features that are available. These properties 
are automatically discovered and recorded when stor- 
age or storage systems are added to resource pools. 


The conformance engine uses the capacity and 
attributes of resource pools when determining the best 
location for data to be provisioned. A single resource 
pool might contain different tiers of storage represent- 
ing different performance and cost attributes. 


A resource pool serves two purposes. The first is 
to reduce the total number of distinct objects a storage 
administrator must manage, and the second is to allow 
greater opportunities for space and load optimizations. 


Policies such as thin provisioning limits and 
RBAC can be applied to whole resource pools, rather 
than to individual storage systems or components. 
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Since a resource pool consists of discrete quanti- 
ties of storage, the larger the pool, the more opportuni- 
ties to optimize for space and load balancing exist 
within that resource pool. 

User-Defined Properties 

Although many properties of resource pool mem- 
bers are discovered automatically, certain properties 
might be explicitly defined by administrators. This 
permits more flexibility and control when matching 
provisioning requests with available resources. For in- 
stance, it might be desirable for administrators to add 
an explicit property related to geography to a resource 
pool member. This property then might be specified as 
part of a provisioning policy and matched against 
available resources in a resource pool that has been as- 
signed this property. 

Data Sets 


Data sets represent a collection of data and their 
replicas. In our current implementation, “data set” 
refers to the data contained within one or more storage 
containers, not the storage containers. The storage 
containers used by a data set might change, over time, 
due to load- or space-balancing actions or policy 
changes. These changes should be transparent to the 
users of the data. A data set is provisioned from exist- 
ing storage containers or from a resource pool accord- 
ing to policy. The data set definition is stored in the 
external database. 


Data sets also have provisioning and data protec- 
tion policies. The policies apply to all of the data in 
the data set. 


A data set serves four purposes. The first is to al- 
low data administrators to manage their data without 
requiring that they understand how the storage is con- 
figured or where it is physically located. Once the data 
set has been defined, the administrators only have to 
choose the policies that best match their data manage- 
ment requirements. 


The second purpose of a data set is to reduce the 
number of managed objects. A data administrator 
might have a lot of data that must be monitored and 
managed as a single unit spread across many distinct 
storage containers, such as an Oracle databases or Ex- 
change applications. A data set allows both the storage 
and data administrators to manage and view the data 
as a single unit. 


The third purpose is that a data set provides a 
convenient handle to all of the replicas, allowing ad- 
ministrators to view or restore versions of their data 
without requiring knowledge of where those versions 
are (or were) stored. 


The fourth is that the data set provides the map- 
ping between the physical storage and the desired be- 
havior associated with its policies. As new storage capa- 
bilities are added to the system, or policies are changed, 
the data management framework can reconfigure the ex- 
isting storage containers, or possibly migrate data to 
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new storage containers, to better satisfy the data set poli- 
cy requirements. 


Policies 


A policy describes how a data set should be con- 
figured. This configuration specifies both how the data 
set should be provisioned and how it should be pro- 
tected. In our framework, we treat these as distinct 
policies. 

A provisioning policy consists of a set of at- 
tributes that the data set requires from a particular re- 
source pool. Specific attributes include — but are not 
limited to — cost, performance, availability, how the 
data can be accessed, and what to do in out-of-space 
situations. 


A data protection policy is a graph in which the 
nodes represent how storage must be configured on re- 
source pools and the edges describe how the data must 
be replicated between resource pools. The nodes have 
attributes such as the retention periods for Snapshot 
copies. The edges have attributes such as lag threshold 
or whether mirroring or full backup replication is re- 
quired. 


Policies are used by storage administrators to de- 
scribe how storage should be provisioned and protect- 
ed. These policies are then used any time data sets are 
provisioned, eliminating configuration errors that re- 
sult from ad-hoc processes. These policies can also be 
given access control by the storage administrator, such 
that not all resources and configurations are available 
to all data administrators. For instance, some data ad- 
ministrators might not have access to policies that re- 
quire highly reliable or high-performance storage (be- 
cause of the expense required to satisfy those poli- 
cies). 

Data administrators are free to select any autho- 
rized provisioning and protection policies that meet 
their desired data behavior without regard to how the 
storage is configured or located. 


In practice, the data administrator might assign 
provisioning or protection policies to data sets current- 
ly stored in containers which are incompatible with 
the policy. 

For example, if a data set includes data which re- 
sides on a storage system without a SnapMirror li- 
cense, a policy specifying a mirror relationship be- 
tween the primary and secondary node cannot be con- 
formed to without reconfiguration. Similarly, if a data 
set has a policy attached, data administrators might 
add members to the primary node of the data set 
which are incompatible with the policy. In both cases, 
the conformance engine can detect the conflict and ex- 
plain to the data administrator why the underlying 
storage needs to be reconfigured or the data migrated. 
The administrator can cancel the operation or approve 
the tasks proposed by the conformance engine to bring 
the data set into conformance. 
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Conformance Engine 


The conformance engine uses the policy to con- 
figure the underlying storage. The conformance engine 
service ensures that the resources used by the data set 
conform to the attributes described in the associated 
policy. Much of the automation associated with our 
management framework is derived from this construct. 
The conformance engine first monitors the physical en- 
vironment and then compares the physical environment 
to the desired configuration specified by the policies. If 
there is a deviation from configured policy, the software 
alerts administrators to the violation and, in some cases, 
automatically corrects the problem. 


The conformance engine uses the management ob- 
ject definitions stored in the database when it checks for 
policy deviations. 


The conformance engine is split into two parts. 
This first part performs a comparison between the pol- 
icy associated with the dataset and the physical envi- 
ronment, and then prepares a list of proposed actions 
to bring the data set into conformance. The second 
part executes the resulting actions. 


By breaking this module into two parts, we allow 
the software to “dry run” any policy changes or re- 
quests in order to observe and confirm the resulting 
actions. This is an important feature in cases in which 
policy changes might result in highly disruptive ac- 
tions, such as reestablishing a baseline replica on dif- 
ferent storage, or in cases in which administrators sim- 
ply want to review any changes before committing to 
them. 


For example, the conformance engine periodical- 
ly compares the data protection relationships protect- 
ing the members of a data set with the policy associat- 
ed with the data set. For each node in the data set, the 
conformance engine determines the type of data pro- 
tection relationship, or relationships, which policy dic- 
tates this node should be the source of. Then, for each 
member of the node, the conformance engine attempts 
to find a data protection relationship of this type origi- 
nating with the member and terminating with a mem- 
ber of the destination node for each of the node’s out- 
going connections. Ifa relationship is not found, a task 
is generated to construct one. 


The results of the conformance run include the 
task list and an explanation of what actions the system 
will perform to bring the data set into conformance 
with the configured policy. For most actions, these 
tasks are executed automatically, but some require us- 
er confirmation. Over time, the set of tasks administra- 
tors are willing to automate without user intervention 
will grow. It is also possible that ““unresolvable” tasks 
will be generated which require user intervention be- 
fore they can be executed. 


Role Based Access Control (RBAC) 


RBAC controls management access to all of the 
management objects. An RBAC service allows a 
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security administrator to configure which role can per- 
form which operations on which objects. Whenever 
any operation is attempted on any management object, 
the RBAC service is consulted for authorization. The 
RBAC configuration is maintained in the same data- 
base as the data sets, policies and resource pools. 


The RBAC system allows storage architects to 
delegate responsibility to data administrators. RBAC 
allows a storage administrator to safely allow data ad- 
ministrators to select policies and resource pools for 
their data sets without relinquishing control over the 
particular resources used. 


Once storage resources are in use by a data set or 
assigned to a resource pool, data administrators, who 
are not permitted to manage storage directly, are able 
to indirectly manage them by performing operations 
on the data set or resource pool. 


taskList GetConformanceTasks(dataSet ds) { 


taskList tl; 
policy p = ds->getPolicy(); 


foreach (node n in p->getNodes()) { 
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Comparing the Traditional and Data Set Views 


Consider a data center with users that have home 
directories that are accessed with NFS and CIFS, Ora- 
cle deployments over NFS, a Microsoft Exchange de- 
ployment and varying degrees of data protection. 


In Figure 2 we show the traditional view. In Fig- 
ure 3 we show the data set view. The traditional view 
presents a detailed schematic of how the infrastructure 
is currently configured. The fact that there are several 
data sets and what the data protection and provision- 
ing polices associated with those data sets are is ob- 
scured. Furthermore, the schematic presents a lot of 
information that is not always important and also 
hides the fact that some of the differences are irrele- 
vant. The data set view, on the other hand, makes it 
very clear which data is using the infrastructure and 
what the provisioning and data protection policies are. 


foreach (connection c in n->getOutgoingConnections()) { 


destNode dn = c->getDestNode() ; 


if (c->isBackupRelationship()) { 
// backups are performed on qtrees 
foreach (qtree q in n->getQtrees()) { 
bool conformant = FALSE; 
foreach (destMember dm = q->getDestMembers()) { 
if (dn->hasMember(dm)) { 
conformant = TRUE; 


break; 
} 
} 


if (!conformant) { 


// generate a task to create a backup relationship 
tl->addTask(q, c, dn); 


} 
} 


if (c->isMirrorRelationship()) { 
// volumes are mirrored 


foreach (volume v in n->getVolumes()) { 
bool conformant = FALSE; 
foreach (destMember dm = v->getDestMembers()) { 
if (dn->hasMember(dm)) { 
conformant = TRUE; 


break; 
} 
} 


if (!conformant) { 


// generate a task to create a mirror relationship 


tl->addTask(v, 


} 
} 
return tl; 


} 


dn) ; 


Figure 1: Algorithm used to compare data protection relationships against a configured data protection policy and 
generate conformance tasks. If no tasks are returned, the data set is in conformance. 
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The data set view also aggregates the physical infra- 
structure. Rather than seeing a schematic layout with 
specific components, we see that there are three differ- 
ent tiers of storage that can be used for protection and 
provisioning. The data set view is easier to use when 
performing daily management. The schematic view is 
important when you need to actually drill down to and 
manipulate the physical components directly, such as 
when the hardware breaks or needs to be changed. 


Implementation 
Protection Manager 


The first system to incorporate the framework is 
Protection Manager, from Network Appliance. It al- 
lows backup administrators and storage architects to 
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coordinate their efforts to construct end-to-end data 
protection solutions using a workflow-based graphical 
user interface. In addition to integrating several Net- 
work Appliance data protection technologies, Protec- 
tion Manager uses higher-level abstractions to lever- 
age administrators’ time. 


Protection Manager builds on the functionality of 
DataFabric Manager by extending the client/server ar- 
chitecture. Figure 4 shows how Protection Manager 
and various components that are used to protect stor- 
age relate. The dashed lines represent open network 
APIs. 


Storage architects use Protection Manager to define 
data protection policies and define resource pools con- 
sisting of aggregates available for secondary storage 
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Figure 2: The storage infrastructure view presents a detailed schematic view that obscures how the data uses the in- 


frastructure and omits the data’s requirements. 
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Figure 3: The data set view shows what data is using the infrastructure and what requirements are associated with 
the data. The data set view abstracts out the details of the underlying storage infrastructure. 
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provisioning. Then, backup administrators define data 
sets by adding primary storage objects, such as volumes 
and qtrees. Based on the data protection policy as- 
signed to the data set (either by the storage architect or 
by the backup administrator), Protection Manager pro- 
visions the required secondary storage from the con- 
figured resource pools and creates backup and mirror 
relationships. 
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Figure 4: High-level architecture of Protection Man- 
ager. 


Once the data set is in conformance with its con- 
figured data protection policy, Protection Manager 
continues to monitor its status for deviation and takes 
action to bring the data set back into conformance. Be- 
sides infrastructure errors, out-of-space conditions and 
policy changes, Protection Manager also monitors 
changes in the primary data, such as the creation of a 
new Flex Vol volume. 


As an example, if the backup administrator cre- 
ates a data set, “Branch Office,” and assigns the policy, 
“Back up, then mirror,” Protection Manager knows 
that the administrator wants the primary data added to 
the data set to be protected with a local Snapshot sched- 
ule, backed up to secondary storage, and then mirrored 
to tertiary storage. By assigning resource pools for sec- 
ondary and tertiary storage provisioning, the admini- 
strator tells Protection Manager from where to provi- 
sion storage. Protection Manager will use the policy’s 
configured schedules for local Snapshot, Snap Vault be- 
tween primary and secondary, and SnapMirror updates 
between the secondary and tertiary to protect the prima- 
ry data. 


If a new volume is created in a storage system 
and added to the data set, Protection Manager will dis- 
cover the new volume and declare that the data set is 
out of conformance with its configured data protection 
policy. To rectify the situation, Protection Manager 
will begin making Snapshot copies of the new volume, 
provisioning a secondary volume, creating a Snap- 
Vault relationship between the new volume and the 
provisioned secondary volume, provisioning a tertiary 
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volume, and creating a SnapMiurror relationship be- 
tween the secondary and tertiary volumes. Only then 
is the data set considered in conformance once again. 
None of these tasks requires any user interaction with 
the system. 
Geography 

Resource pools can be used to group storage re- 
sources by performance, availability, owner or any 
combination thereof, but one of the most interesting 
uses of resource pools is for geographic grouping. The 
data administrator knows the geographic location of 
the primary data in his data set and the configuration 
of resource pools by the storage administrators gives 
him visibility into the geographic location of potential 
backup and mirror sites. This gives him the ability to 
assign a data protection policy to his data set and se- 
lect remote storage resources to implement the policy. 


Availability of Services 


Although the definitions of data sets, resource 
pools, and policies are maintained in a database within 
DataFabric Manager in our current implementation, 
the enforcement mechanism of these policies is con- 
figured on storage systems when possible. For in- 
stance, the scheduling policy for replication resides in 
the policy definitions, but the actual scheduling of 
replication updates should be configured on the asso- 
ciated storage systems. This approach minimizes dis- 
ruptions in service should a centralized server, such as 
DataFabric Manager, fail. 


The DataFabric Manager server itself can be 
configured to run in a high availability configuration 
such as Microsoft Cluster Server. 


Impact 


In this section, we first evaluate to what extent 
Protection Manager, using our data management frame- 
work, enables division of labor between administrators, 
empowers data administrators to configure storage and 
to what extent storage is automatically reconfigured 
when data protection policy changes. We then explore 
whether our data management framework, as imple- 
mented in Protection Manager, actually simplifies ad- 
ministration. 

Division of Labor Between Administrators 


Storage administrators and data administrators 
have different organizational roles. In order to suc- 
cessfully configure data protection, each administrator 
must coordinate his efforts with the other. Only the 
data administrator can identify the subset of stored 
data which constitutes a particular data set and only 
the storage administrator can identify pools of re- 
sources for the data administrator to use. 


In a traditional environment, the data administra- 
tor must construct a request describing the storage 
containers he wants protected and either the degree of 
protection required or enough information about the 
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use of the data set so that the storage administrator can 
determine the degree of protection required. This com- 
munication could take the form of a help desk ticket or 
an email. Either way, it must be processed manually and 
potential for human error is high. Furthermore, the cost 
of the interaction discourages frequent changes to re- 
quirements and encourages “over provisioning” where- 
by data administrators request more resources than they 
need to avoid additional requests later. 


With the Protection Manager tool, the storage ad- 
ministrator defines resource pools and data protection 
policies and delegates access to them to the data ad- 
ministrator. Later, the data administrator defines a data 
set and assigns the appropriate data protection policy. 
The data administrator only has to know which policy 
is required for the class of data in his data set. In the 
definition of the policy, the storage administrator has 
already encapsulated what that means in terms of 
schedules, backup and mirror relationships and reten- 
tion times. 


By eliminating the need to coordinate the efforts 
of two administrators every time a data set is protect- 
ed, the Protection Manager tool simplifies the data 
protection process. 


Empowering Data Administrators to Configure 
Storage 


In this section we evaluate to what extent the 
data management framework implemented in Protec- 
tion Manager actually empowers the data administra- 
tor to configure storage to match data protection re- 
quirements. To perform the evaluation, we focus on 
the following task: given a data set, which steps does 
the data administrator have to execute to configure the 
data set with local Snapshot copies, remote copies of a 
subset of the Snapshot copies, and mirrors of the re- 
mote copies of the Snapshots copies. 


Before we describe what the data administrator 
does using Protection Manager, consider how the stor- 
age is traditionally configured. The primary volume 
must be configured with a Snapshot schedule. On the 
secondary system, a volume must be created along 
with an appropriate SnapVault schedule. Finally, on 
the disaster recovery storage system, a volume must 
be created along with a SnapMirror configuration that 
copies the data from the secondary volume. To per- 
form these tasks without Protection Manager, the data 
administrator must have administrative access to the 
storage systems and understand the details of how to 
correctly configure each element. 


Using the Protection Manager UI pictured in Fig- 
ure 5, the data administrator first selects a policy that has 
been created by the storage administrator. The data ad- 
ministrator must then select resource pools that have 
been created by the storage administrator for the various 
nodes in the policy. At this point the conformance engine 
will create the secondary and disaster recovery volumes, 
configure the Snapshot schedule, set up the SnapVault 
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relationship and configure the SnapMirror relationship 
using the APIs available on the storage system. 
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Figure 5: Protection Manager UI. 


We have thus shown how, using Protection Man- 
ager and our data management framework; a data ad- 
ministrator can configure the underlying storage to 
meet his or her data protection requirements. 


Automatic Reconfiguration of the Storage 


In this section we evaluate to what extent Protec- 
tion Manager allows a storage administrator to reconfig- 
ure storage automatically as a result of a change of poli- 
cy. We assume that the storage administrator has two 
data protection policies in his or her environment. The 
first data protection policy only has a local Snapshot 
schedule. The second data protection schedule has a lo- 
cal Snapshot schedule and a remote backup schedule us- 
ing SnapVault. To perform the evaluation we focus on 
the following task: changing the remote backup schedule 
for the second policy. 


Before we describe the steps using the Protection 
Manager UI, we describe how the storage must be re- 
configured. All of the secondary volumes that have 
SnapVault relationships must have their schedules 
changed to conform to the new schedule. To perform 
this task, the storage administrator must log into every 
secondary and modify each schedule. 


Using the Protection Manager UI, the data ad- 
ministrator selects the policy to be modified. The stor- 
age administrator then modifies any attributes, for ex- 
ample, how frequently Snapshot copies should be 
made. The storage administrator then confirms those 
changes. When the conformance engine runs, it will 
identify the secondary volumes that must use the new 
schedule and modify the schedule. 


Simplifying Data Management 
In this section we evaluate whether the data man- 
agement framework actually simplifies data manage- 


ment. We consider three metrics. The first is how 
many management entities must be administered. The 
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second is how many tasks need to be performed. The 
third is whether we enable tasks that were impractical 
before we implemented this new framework. 


The data management framework significantly 
reduces the number of management entities for the 
data and storage administrator. Rather than having to 
administer each individual element of a data set (for 
example, all of the home directories), the data admini- 
strator manages a single data set. For the storage ad- 
ministrator, data sets eliminate the need to monitor in- 
dividual storage containers and relationships to deter- 
mine whether or not the infrastructure is broken. In 
addition, the use of resource pools eliminates the need 
to manually manage space. 


The data management framework clearly reduces 
the number of steps to perform any task. An important 
consequence of the automation that the conformance 
engine performs is the elimination of operator errors. 
This significantly simplifies administration. 


The data management framework does enable 
conformance monitoring, a task that was very difficult 
if not impractical in large environments. Consider 
what it would mean to monitor conformance without 
the framework. Conformance has three elements. The 
first is a description of the desired behavior. The sec- 
ond is a monitoring of the actual behavior. The third is 
a comparison of the two. Because the desired behavior 
is encoded in a way that allows for comparison by 
software, we are able to perform this task trivially. 
Without that encoding, a human would have to manu- 
ally compare the desired behavior to the actual behav- 
ior. Furthermore, because a human would be involved, 
the number of attributes that can be compared would 
be limited. Protection Manager’s conformance engine 
can check any number of attributes of a data protec- 
tion policy and can do it more quickly and frequently 
than a human could. 


Conclusions 


In this section, we have shown that Protection 
Manager allows data administrators to configure all 
aspects of the storage system for data protection. We 
have shown that Protection Manager can reconfigure 
the storage if the storage is out of conformance. Final- 
ly, we have shown how Protection Manager, using our 
framework, simplifies data management by reducing 
the number of managed entities, reducing the number 
of steps that need to be performed and enabling tasks 
that were too difficult to be performed manually. 


Summary 


In the traditional approach to storage and data 
management, policy is separated from the actual soft- 
ware that implements the policy. As a result, policy 
has to be carefully translated into operations by stor- 
age administrators. Each step described could intro- 
duce a human error, requiring significant time and ef- 
fort to correct. By encoding the policy and being able 
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to manage both the backup and storage process, using 
the data set, policy and resource pool abstractions, the 
storage administrator is able to perform tasks in sub- 
stantially fewer steps. 


Future Work 


The task of modeling both cost and performance 
attributes in policies remains the subject of future 
work. At this point we have deferred to ad-hoc meth- 
ods, such as manual classification of systems based on 
user-defined properties, to guide the provisioning 
process. Future versions of software will do more au- 
tomatic classification of resources in this area. 


Data set hierarchies are also the subject of future 
work, especially considering complex application ar- 
chitectures such as SAP or Oracle deployments. These 
systems place different storage requirements on differ- 
ent subcomponents of each application, which results 
in different policies. A data set representing the entire 
application makes sense for performance monitoring 
or accounting purposes, but the sub-components might 
be composed of separate data sets, each with unique 
provisioning and protection policies. 


In addition, the migration of data from one stor- 
age container to another as a result of space or load 
balancing has not been explored. 


Finally, we intend to explore further refinements 
of how policies are expressed. The current protection 
policies are topologies of how data is to be made re- 
dundant. We want to explore how administrators can 
describe the amount of redundancy desired and let the 
system determine the protection topology. 


Conclusions 


By introducing the concept of data sets to Net- 
work Appliance management software, we have ab- 
stracted the management of physical storage contain- 
ers from the data. As a result, we can now use soft- 
ware to manage the data and adapt to changes in poli- 
cies or resources automatically, thus making it possi- 
ble for administrators to manage more data. 
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ABSTRACT 


We present an architecture’ designed for alert verification (i.e., to reduce false positives) in net- 
work intrusion-detection systems. Our technique is based on a systematic (and automatic) anomaly- 
based analysis of the system output, which provides useful context information regarding the network 
services. The false positives raised by the NIDS analyzing the incoming traffic (which can be either 
signature- or anomaly-based) are reduced by correlating them with the output anomalies. We designed 
our architecture for TCP-based network services which have a client/server architecture (such as 
HTTP). Benchmarks show a substantial reduction of false positives between 50% and 100%. 


Introduction 


Network intrusion-detection systems (NIDSs) are 
considered an effective second line of defense against 
network-based attacks directed to computer systems 
[4, 11], and — due to the increasing severity and likeli- 
hood of such attacks — are employed in almost all 
large-scale IT infrastructures [1]. 


The Achille’s heel of NIDSs lies in the large 
number of false positives (i.e., notifications of attacks 
that turn out to be false) that occur [26]: practitioners 
[24, 31] as well as researchers [3, 8, 15] observe that it 
is common for a NIDS to raise thousands of alerts per 
day, most of which are false alerts. Julisch [16] states 
that up to 99% of total alerts may not be related to real 
security issues. Notably, false positives affect both sig- 
nature and anomaly-based intrusion-detection systems 
[2]. A high rate of false alerts is — according to Axels- 
son [3] — the limiting factor for the performance of an 
intrusion-detection system. False alerts also cause an 
overload for IT personnel [24], who must verify every 
single alert, a task that is not only labor intensive but 
also error prone [9]. Indeed, a high false positive rate 
can even be exploited by attackers to overload IT per- 
sonnel, thereby lowering the defenses of the IT infra- 
structure. 


The main reason why NIDSs raise false positives 
is that — quoting Kruegel and Robertson [18] — they 
are often run without any (or very limited) information 
about the network resources that they protect (i.e., the 
context). Chaboya, et al. [6] state that the context 
knowledge (e.g., network and system configurations) 
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can improve significantly alert verification. On the 
other hand, building and updating a database of the 
configurations or running vulnerability assessment 
tools (e.g., Nessus [35]) to provide context knowledge 
is expensive and often not feasible when dealing with 
complex systems (indeed these activities require addi- 
tional labor of IT personnel, since the process of using 
them cannot be completely automated). Most current 
techniques to improve alert verification are tailored for 
specific attacks [14, 41] (e.g., worm-like) or support 
only signature-based NIDSs [33, 36] (e.g., Snort’s 
team has developed a specific plug-in, flowbits, to 
cope with this, but it has limited functionality). 


Our thesis is that, in many relevant situations, the 
context information can be obtained by a systematic 
(and automatic) anomaly-based analysis of the output 
traffic of the monitored network services; we believe 
this is possible when the output traffic presents some 
regularities. 


To demonstrate our claims, we have developed 
ATLANTIDES (Architecture for Alert verification in 
Network Intrusion Detection Systems) an innovative 
architecture for easing the management of any NIDS 
(be it signature or anomaly-based) by reducing, in an 
automatic way, the number of false alarms that the 
NIDS raises. The main idea behind ATLANTIDES is 
simple: a successful attack often causes an anomaly in 
the output of the service [44], thus modifying the nor- 
mal output outcome. Detecting this anomaly can help 
in reducing false alerts. For instance, a successful SQL 
Injection attack [43] against a web application often 
causes the output of SQL table content (e.g., user/ad- 
min credentials) rather than the expected web content. 

ATLANTIDES, which is completely network- 
based,? works by analyzing (using n-Gram analysis 


2It relies only on information gathered over the network, 
without involving any host-based component. 
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[13]) and modeling the normal output payload of the 
monitored network services that is expected to be sent 
in response to a client request. This normal output is 
specific to the site; therefore the derived models re- 
flect — in a way — the network/system context. By cor- 
relating the anomalies detected on the output with the 
alerts raised by the NIDS monitoring the input traffic, 
we can discard a number of the latter as being false 
alerts. This way we obtain a system that raises consid- 
erably less false positives that the original NIDS, with- 
out this correlation system. 


Because it is based on output payload analysis, 
our architecture is designed for TCP-based client/serv- 
er network services (such as HTTP). Like all (exter- 
nal) payload-based analysis, ATLANTIDES cannot 
work properly with encrypted data unless the crypto- 
graphic keys are provided. 


In the past, simple correlations between input 
and output traffic have already been used to identify 
possible worm attacks [14, 41]. To the best of our 
knowledge, ATLANTIDES is the first proposed solu- 
tion for alert verification that: 

¢ works in combination with both signature-based 
and anomaly-based NIDSs 

® operates in a completely automatic way after a 
quick setup, without any further human in- 
volvement (i.e., reducing the IT personnel over- 
load), thus easing NIDS management 


We benchmarked ATLANTIDES in combination 
with the signature-based NIDS Snort [34, 37], as well 
as in combination with the anomaly-based NIDS PO- 
SEIDON [5]. We carried out benchmarks both on a 
private data set as well as on the common DARPA 
1999 data set [22] (for the sake of completeness and to 
allow duplication of our results, despite criticism [23, 
25]). In seven out of eight cases, our benchmarks 
show a reduction of false positives between 50% and 
100%. 


Preliminaries 


In this section, we introduce the concepts used in 
the rest of the paper and explain how false positives 
arise in signature and anomaly-based systems. 


Signature-Based Systems 


Signature-based systems (SBSs), e.g., Snort [34, 
37], are based on pattern-matching techniques: the 
NIDS contains a known-attack signature database and 
tries to match these signatures with the analyzed data. 
When a match is found, an alert is raised. A specific 
signature must be developed off-line, and then loaded 
into the database before the system can begin to detect 
a particular intrusion. One of the disadvantages of SB- 
Ss is that they can detect only known attacks: new at- 
tacks will be unnoticed till the system is updated, cre- 
ating a window of opportunity for attackers (and af- 
fecting NIDS completeness and accuracy [10, 11)). 
Although this is considered acceptable for detecting 
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attacks to, e.g., the OS, it makes them less suitable for 
protecting web-based services, because of their ad hoc 
and dynamic nature. 
False Positives in Signature-Based Systems 

SBSs raise an alert every time that traffic match- 
es one of the signatures loaded into the system. Con- 
sider for example the path traversal attack, which al- 
lows to access files, directories, and commands resid- 
ing outside the (given) web document root directory. 
The most elementary path traversal attack uses the 
‘* /” character sequence to alter the resource location 
requested in the URL. Variations include valid and in- 
valid Unicode-encoding (‘‘..%u2216” or ‘‘..%cO%af’’), 
URL encoded characters (“%2e%2e%2f’’), and double 
URL encoding (‘‘..%255c”) of the backslash character 
(excerpted from the WASC Threat Classification [43]). 


To detect these attacks many SBSs (using an out- 
of-the-box configuration) raise an alert each time they 
identify the pattern “../” in the incoming traffic. Un- 
fortunately, this pattern could be present in legal traf- 
fic too; some Content Management Systems (CMSs) 
insert relative paths in parameters to load files, which 
causes SBSs to raise a high number of false alerts. 
These false alerts can be avoided by deactivating the 
specific rule. On the other hand, this prevents the 
NIDS from detecting this sort of attacks. 

Tuning Signature-Based Systems 

The main reasons why alerts produced by SBSs 
turn out to be either false or irrelevant include the fol- 
lowing: 

e Writing signatures for NIDS is a thorny task 
[32], in which it is difficult to find the right bal- 
ance between an overly specific signature 
(which is not able to detect a simple attack vari- 
ation) and an overly general one (which will 
classify legitimate traffic as an attack attempt). 

¢ The monitored environment is not susceptible 
to a certain vulnerability. 

¢ Misconfigured network devices or services pro- 
ducing atypical output (usually, in this case, it 
is possible to observe recurrent and periodic 
phenomena). 


A good deal of the false positives raised by a 
SBS can be suppressed by a tuning activity: this activ- 
ity, based on deactivation of unneeded signatures, re- 
quires a thorough analysis of the environment by qual- 
ified IT personnel. Finally, to remain effective, SBSs 
require configuration updating to reflect changes in 
the environment: new vulnerabilities are discovered 
daily, new signatures are released regularly, and sys- 
tems may be patched, thereby (possibly adding or) re- 
moving vulnerabilities. 

Anomaly-Based Systems 

Anomaly-based systems (ABSs) use statistical 
methods to monitor network traffic. Intuitively, an 
ABS works by training itself to recognize acceptable 
behavior and then raising an alert for any behavior 
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outside the boundaries of its training. In the training 
phase, the ABS builds a model of the normal network 
traffic. Later, in the operational phase, the ABS flags 
as an attack any input that significantly deviates from 
the model. To determine when an input significantly 
deviates from the model the ABS uses a distance func- 
tion and a threshold set by user: when the distance be- 
tween the input and the model exceeds the threshold, 
an alarm is raised. 


The ABSs’ main advantage is that they can de- 
tect zero-day attacks: novel attacks can be detected as 
soon as they take place. Clearly, because of their sta- 
tistical nature, ABSs are bound to raise a number of 
false positives, and the value of the threshold actually 
determines a compromise between the number of false 
positives and the number of false negatives the IT se- 
curity personnel is willing to accept. 

False Positives in Anomaly-Based Systems 

The high false positive rate is generally cited as 
one of the main disadvantages of anomaly-based sys- 
tems. The value of the threshold has a direct influence 
on both false negative and false positive rates [40]: a 
low threshold (too close to the model) yields a high 
number of alerts, and therefore a low false negative 
rate, but a high false positive rate. On the other hand, a 
high threshold yields a low number of alerts in general 
(therefore a high number of false negatives, but a low 
number of false positives). The most commonly used 
tuning procedure for ABSs is finding an optimal 
threshold value, i.e., the best compromise between a 
low number of false negatives and a low (or accept- 
able) number of false positives. This is typically car- 
ried out manually by trained IT personnel: different 
improving steps may be necessary to obtain a good 
balance between detection and false positive rates. 


Architecture 


ATLANTIDES’s architecture (see Figure 1) con- 
sists of one external and two internal components. The 
external component is the NIDS monitoring the in- 
coming traffic. We do not make any assumption about 
it except that it is capable of raising an alert: AT- 
LANTIDES can work together with any kind of NIDS 
(signature or anomaly-based). 


The first internal component is the output anom- 
aly detector (OAD), which is actually an anomaly- 
based NIDS monitoring the outgoing traffic: the OAD 
refers to a statistical model describing the normal out- 
put of the system, and flags any behaviour that signifi- 
cantly deviates from the norm as the result of a possi- 
ble attack. 


The second internal component is the correlation 
engine (CE), which tracks (using stateful-inspection 
[7]) and correlates alerts related to incoming traffic 
and raised by the input NIDS with the output produced 
by the OAD. 


ATLANTIDES works as follows (see Figure 1). 
The input NIDS monitors the incoming traffic while, 
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simultaneously, the OAD (after a training phase) 
analyses the output of network services. When the in- 
put NIDS raises an alert, this is forwarded to the CE, 
together with the information regarding the communi- 
cation endpoints (i.e., source and destination IP ad- 
dresses, source and destination TCP ports as well as 
sequence numbers and communication status) of the 
packet that raised the alert. The CE uses a hash-table 
to store this information, using less than 20 bytes per 
each entry: thus, the CE does not requires much mem- 
ory to store the information, and ATLANTIDES can 
handle even a rate of 1000 alerts per second with a to- 
tal memory space of 1 MB (in case the connections 
are kept in memory, e.g., for a maximum time of 60 
seconds before being dropped). At this time, the alert 
is not considered an incident yet (it is a pre-alert) and 
is not forwarded immediately to IT specialists. 


INCOMING 
——"veme INPUT NIDS 
INCOMING 


TRAFFIC 


ATLANTIDES 


<-YES CORRELATION 
TRUE ENGINE 
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NO 
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Figure 1: ATLANTIDES’s architecture. 
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Next, the CE marks communication relative to 
the given endpoints as suspicious and waits for the 
output of the OAD: if the OAD detects an anomaly in 
the outgoing traffic related to the tracked communica- 
tion, then the system considers the alert as an incident 
(i.e., a positive) and the alert is forwarded to the IT 
specialists for further handling and countermeasure re- 
actions, otherwise it is considered a false positive and 
discarded. The IT personnel can manually set (or ad- 
just) the time value ¢ that the CE waits before drop- 
ping an entry from its hash-table, because no output 
has been produced: during our experiments we fixed 
this value to 60 seconds. This time could be critical if 
an attack results in a large data transfer (but in this 
case the OAD should detect the anomaly in the trans- 
ferred data) or in the case where attacker is able to de- 
lay server response (although this seems quite difficult 
to realize and the literature does not provide any ex- 
ample of such an attack). 


Although a delay is introduced to allow the OAD 
to process the data sent back to the client, this does not 
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affect the detection itself: in fact, the delay, in the 
worst case of no output sent at all, is equal to the time 
value ¢. In Appendix A we provide the pseudo-code of 
our architecture. 


It should be clear from the architecture that AT- 
LANTIDES will never raise more false positives that 
its input NIDS. In fact, the output of the OAD gener- 
ates false positives or false negatives. The former situ- 
ation cannot take place because the output of the OAD 
is evaluated only when an alert has already been raised 
by the input NIDS: the OAD could mistake the alert- 
related outgoing traffic as anomalous and then forward 
the alert as a true positive, but this would have hap- 
pened in any case, if considering the output of the in- 
put NIDS only. Thus, the worst case is that a false pos- 
itive is not suppressed, but any new false alert cannot 
be generated. 


On the other hand, we have to discuss the possi- 
bility that ATLANTIDES will introduce additional 
false negatives (w.r.t. the input NIDS). This happens 
every time the OAD classifies an alert corresponding 
to a true attack as a false alert. False negatives are a 
common problem for alert verification systems (and 
for ABSs in general). Because of our solution bases its 
verification on an anomaly-based engine, the threshold 
used to discern outgoing traffic can be adjusted manu- 
ally by IT specialists to avoid false negatives (previ- 
ous proposed solutions cannot be tuned in the same 
way, e.g., [18]). an effective threshold automatically. 
Missing Output Response 

What we just described is the most common be- 
havior; nevertheless we have to take into account that 
there exist attacks which, e.g., aim to disrupt com- 
pletely the service or that, exploiting a buffer over- 
flow, radically modify the normal execution. In this 
case, if the OAD does not detect any output related to 
the pre-alert raised by the NIDS, during the time win- 
dow ft, then the pre-alert is considered an incident and 
is forwarded to an IT security specialist. Although this 
could be considered rough, because the missing re- 
sponse could occur for different reasons than a suc- 
cessful attack (e.g., an internal error), this strategy 
does not introduce any additional false negatives/posi- 
tives, since with a single NIDS (monitoring the incom- 
ing traffic) the alert would be forwarded anyway. Fur- 
thermore, Chaboya, et al. [6] experimentally verified 
that most of the buffer overflow attacks against an 
HTTP server do not produce any output from the at- 
tacking requests. Although it is theoretically possible 
that the attacker crafts a particular payload to send a 
normal response on the current connection after the 
exploitation, there exist several difficult technical 
problems which limit the success of this kind of at- 
tack. The attacker must inject an attack payload con- 
taining the routines to generate the normal output too 
(or to jump to the original code where this is done): 
since exploitable buffers are normally small in size, it 
could be difficult to include the necessary payload. 
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Since nowadays attacks against connection-less 
protocols are less common (see the Common Vulnera- 
bilities and Exposures [39] (CVE) database for de- 
tailed statistics), we have designed ATLANTIDES 
with the explicit goal of reducing false positives when 
monitoring network services based on the TCP proto- 
col (e.g., HTTP, SMTP and FTP) where a response is 
typically sent by the server to the client. 


Although we do not aim to handle all kinds of 
possible attacks (e.g., worms or DDoS attacks perpe- 
trated generating a high quantity of legal connections), 
we believe our solution can improve the accuracy of a 
NIDS without any additional component installed di- 
rectly on the monitored hosts (an additional compo- 
nent could affect under certain circumstances host per- 
formance, i.e., a high amount of connections). 


The OAD 


The OAD is basically an anomaly payload-based 
NIDS, monitoring the output of a network service 
rather than the input of it. In our embodiment we 
choose to use the NIDS POSEIDON as the OAD, be- 
cause we are familiar with it and it gives better results 
than its leading competitor [5]. POSEIDON is a 2-tier 
payload-based ABS that combines a neural network 
with n-gram analysis to detect anomalies. POSEIDON 
performs a packet-based analysis: every packet is clas- 
sified by the neural network; then, using the classifica- 
tion information given by the neural network, the real 
detection phase takes place based on statistical func- 
tions considering the byte-frequency distributions (n- 
gram analysis). 

The fact that the OAD is anomaly-based (rather 
than signature-based) has various advantages. The 
OAD can adapt to the specific network environ- 
ment/service, and it does not require the definition of 
new signatures to detect anomalous output, working in 
an unsupervised way (after initial setup). Creating and 
maintaining a set of signatures for outgoing traffic is a 
thorny and labor-intensive task, as these signatures 
heavily depend on local applications, and must be up- 
dated each time that modifications of the application 
change its output content. On the other hand, the OAD 
can simply include these modification in its model, 
without starting training over. The disadvantage of be- 
ing anomaly-based is that our OAD needs an exten- 
sive (though unsupervised) training phase: a signifi- 
cant amount of (normal) traffic data is needed to build 
an accurate model of the service we monitor. 

Setting the Threshold 

As we mentioned in Section Anomaly-Based Sys- 
tems, in ABSs completeness and accuracy are intrinsical- 
ly related and heavily influenced by the threshold value. 
Here, we call completeness the ratio TP/(TP + FN) and 
accuracy the ratio T7P/(TP + FP), where 7P is the 
number of true positives, FN is the number of false 
negatives and FP is the number of false positives 
raised during the benchmarks. Our experiments show 
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reasonably good results, where ¢_max is the maximum 
distance between the analyzed data and the model ob- 
served during the training phase. Thus, we can auto- 
matically set this parameter and IT personnel can later 
adjust it as necessary. 


POSEIDON+ 
arp LDR |_to0% [100% | 
| FP || 1683 (2,83%) | 774 (1,30%) 


Table 1: Comparison between POSEIDON stand- 
alone and POSEIDON in combination with AT- 
LANTIDES using data set A; DR stands for de- 
tection rate (attack instance percentage), while FP 
is the false positive rate (packets and correspond- 
ing percentage); ATLANTIDES reduces false posi- 
tives by more than 50% without affecting the detec- 
tion rate (i.e., without introducing false negatives). 
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Figure 2: Detection rates for POSEIDON in combina- 
tion with ATLANTIDES using data set A (HTTP 
protocol): the x-axis and y-axis present false posi- 
tive rate (packets) and detection rate (attacks in- 
stances) respectively. It is possible to observe that 
ATLANTIDES presents a lower false positive rate 
than POSEIDON, considering the same detection 
rate. It is possible to notice how different AT- 
LANTIDES’ threshold settings affect detection 
and false positive rates. 


Experiments and Results 


To validate our architecture, we benchmark AT- 
LANTIDES in combination with the signature-based 
NIDS Snort [34, 37] as well as ATLANTIDES in 
combination with the anomaly-based NIDS POSEI- 
DON [5]. To carry out the experiments, we employ 
two different data sets. First, we benchmark the sys- 
tem using a private data set. Secondly, we use the 
DARPA 1999 data set [22]: despite criticism [23, 25] 
this is a standard data for benchmarking NIDSs (see, 
e.g., [33, 42]) and it has the advantage that it allows 
one to compare experiments. No other data set, 
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containing sufficient data to perform verifiable bench- 
marks, is publicly available. 


We consider an attack to be successfully detected 
when at least one packet carrying the attack payload is 
correctly flagged as malicious; all the other non-de- 
tected packets carrying the attack payload are not con- 
sidered to be false negatives. On the other hand, each 
packet incorrectly flagged as malicious is considered 
to be a false positive. Thus, the detection rate is at- 
tacked-based, while the false positive rate is packet- 
based. 


Tests With a Private Data Set 


To carry out our validation, and to see how the 
system behaves when trained with a data set that was 
not made attack-free,? we consider a private data set 
we collected at the University of Twente: this is data 
set A, Data were collected on a public network for five 
consecutive working days (24 hours per day), logging 
only TCP traffic directed to (and originating from) a 
heavy-loaded web server (about 10 Gigabytes of total 
traffic per day). This web server hosts the department 
official web sites as well as student and research staff 
personal web pages: thus, the traffic contains different 
types of data such as static and dynamically generated 
HTML pages and, especially in the outgoing traffic, 
common format documents (e.g., PDF) as well as raw 
binary data (e.g., software executables). We did not in- 
ject any artificial attack. 


We focus on HTTP traffic because nowadays In- 
ternet attacks are mainly directed to web servers and 
web-based applications [17]: Kruegel, et al. [19] state 
that web-based attacks account for 20%-30% from 
1999 to 2004 in CVE entries [39]; Symantec Corpora- 
tion [38] reports that, in the first-half of year 2006, 
69% of total discovered vulnerabilities were related to 
web services and, during the same period, more than 
60% of easily exploitable vulnerabilities (whenever 
the exploitation code is not needed or well-known) af- 
fected web applications. Symantec states that typical 
examples of easily exploitable vulnerabilities are SQL 
Injection and Cross-Site Scripting (XSS) attacks. 


To train the anomaly-detection engines of both 
POSEIDON and the OAD on data set A, we used a 
snapshot of the data collected during working hours 
(approximately three hours, 1.8 Gigabytes of data, 
randomly chosen). The chosen training data set had 
not been pre-processed and made attack-free: thus it is 
possible that the model includes some malicious activ- 
ity (that could negatively affect accuracy). For the 
same reason, we randomly chose another snapshot 
(approximately 1.8 Gigabytes of data) to benchmark 
POSEIDON stand-alone against POSEIDON in com- 
bination with ATLANTIDES. 


3This is useful to see how the system performs in the sub- 
optimal situation in which the IT security specialist does not 
have the time to clean up the training data set, a situation 
that is likely to occur often in practice. 
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Figure 3: Detection rates for POSEIDON in combina- 
tion with ATLANTIDES using DARPA 1999 data 
set (SMTP protocol): the x-axis and y-axis present 
false positive rate (packets) and detection rate (at- 
tacks instances) respectively. Is it possible to ob- 
serve that ATLANTIDES presents a lower false 
positive rate than POSEIDON, considering the 
same detection rate. It is possible to notice how 
different ATLANTIDES’ threshold settings affect 
detection and false positive rates. 


ABSs can, obviously, achieve a 100% detection 
rate using a very low threshold value, but this negative- 
ly affects the false positive rate too (as we mentioned in 
Section Anomaly-Based Systems): we set the threshold 
of POSEIDON experimetally to achieve the best detec- 
tion rate at the lowest false positive rate possible. 


The alerts have been classified by the authors: 
we found evidences of XSS and SQL Injection attacks 
[43] (and this is not surprising, accordingly to Syman- 
tec’s report), plus some probes checking for well- 
known paths (33 attack detections in total). Table 1 
summarizes the results we obtained. We cannot com- 
pare ATLANTIDES in combination with Snort on data 
set A for the reason that Snort does not find any true 
attack to the system (Snort raised only false alerts): 
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this is not surprising, since Snort has only few signa- 
tures devoted to SQL Injections and XSS attacks. By 
setting a high threshold value in ATLANTIDES we 
could remove ail the false positives, but this would 
give no indication of the completeness and accuracy 
of ATLANTIDES. Figure 2 shows detailed results of 
ATLANTIDES on data set A. Here, left is better than 
right and above is better than below. A point left-top 
indicates a configuration in which (almost) every at- 
tack has been correctly forwarded, with very few false 
positives left. On the other hand, a point on the low- 
right side indicates a configuration in which some real 
attacks have been incorrectly suppressed and a good 
deal of licit traffic was marked anomalous. 


Tests With the DARPA 1999 Data Set 


The testing environment of the DARPA 1999 da- 
ta set contains several internal hosts that are attacked 
by both external and internal attackers: in our tests, we 
consider only inbound and outbound TCP packets that 
belong to attack connections against hosts inside the 
network 172.16.0.0/16. We focus on FTP, Telnet, 
SMTP and HTTP protocols. This is due to the fact that 
only these protocols, among the ones contained in this 
data set, provide us with a sufficient number of sam- 
ples to train the OAD and, at the same time, allow us 
to compare our architecture with POSEIDON stand- 
alone, that has been benchmarked following the same 
procedures. 


We train the OAD of ATLANTIDES with the da- 
ta of weeks 1 and 3 (attack-free): for each different 
protocol we use a different OAD instance. Afterwards, 
we test ATLANTIDES together with both POSEIDON 
and Snort using week 4 and week 5 traffic. In order to 
distinguish between true and false positives, we refer 
to the attack instance table provided by the DARPA 
data set authors. Table 2 reports a comparison of the 
detection and false positive rates of Snort stand-alone 
(first column), Snort in combination with ATLAN- 
TIDES (second column), POSEIDON stand-alone (third 
column) and POSEIDON in combination with AT- 
LANTIDES (fourth column). 














3303 (11.31%) | 373 (1.35%) 
63776 (6.72%) | 56885 (5.99%) 
| areca | 2707 (159% | 
6476 (3.69%) | 2797 (1.59%) 





Table 2: Comparison between Snort stand-alone, Snort in combination with ATLANTIDES, POSEIDON stand- 
alone and POSEIDON in combination with ATLANTIDES using the DARPA 1999 data set: DR stands for de- 
tection rate (attack instance percentage), while FP is the false positive rate (packets and corresponding percent- 
age); ATLANTIDES reduces false positives by more than 50% most of the times, being close to zero in 3 tests, 
without affecting the detection rate (i.e., without introducing false negatives). 
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In both cases, ATLANTIDES achieves a substan- 
tial improvement on the stand-alone system, neither af- 
fecting the detection rate nor introducing false nega- 
tives; ATLANTIDES reduces the false positive amount 
by at least 50% on every protocol benchmarked, except 
for the Telnet protocol together with POSEIDON. In 
our opinion, this discrepancy is due to the fact that Tel- 
net has a great output variability, since an user could 
issue hundreds of different commands with different 
output; on the other hand, protocols like HTTP, FTP 
and SMTP present well-defined protocol schemas to 
exchange information between client and server. AT- 
LANTIDES is not applied to SMTP traffic in combi- 
nation with Snort because in this case Snort raises no 
false positives. 


Related Work 


The problem of alert verification has been ad- 
dressed using two different kinds of approaches: we 
have techniques for identifying true positives, and 
techniques for identifying false positives. The main 
difference between our work and the papers described 
below is that we take into account the outgoing traffic 
of the system. 

Identifying True Positives 

Kruegel and Robertson [18] introduces a plug-in 
for Snort to verify alerts: the plug-in integrates the 
Nessus vulnerability scanner into the Snort’s core. 
When an alert is fired, this is not immediately for- 
warded but is firstly passed to the verification engine. 
Since every Snort’s signature comes with a unique 
identifier (assigned by CVE [39]), this index is used to 
check the presence of a corresponding Nessus attack 
script. If found, the script is executed against the target 
machine/network: the output is extracted and used to 
flag the alert as either true or false; an output cache is 
used to avoid further verification for the same alert/ 
target. Although this approach is effective, there are 
several drawbacks: one has to maintain the Nessus’s 
attack script database updated, and this approach 
works only for signature-based NIDSs, while AT- 
LANTIDES can work with both types and in a com- 
plete automatic way (i.e., no manual updates needed). 


Ning, et al. developed a model [30] and an intru- 
sion-alert correlator [27] to help human analysts dur- 
ing the alert verification phase. This work is based on 
the observation that most attacks consist of several re- 
lated stages, with the early stages preparing for the lat- 
er ones. Hyper-alert correlation graphs are used to rep- 
resent correlated alerts in an intuitive way. However, 
this correlation technique is ineffective when attackers 
use a different (yet not spoofed) IP source address at 
each attack step. Ning and Cui [27] demonstrate the ef- 
fectiveness of this approach when applied on a small 
data set (due to the exponential complexity of hyper- 
alert graphs): in [28, 29] the same authors present other 
utilities they developed to facilitate the analysis of large 
sets of correlated alerts, and report some benchmarks 
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employing network traffic used during the DEFCON 8 
Capture the Flag (CTF) event [12]. ATLANTIDES 
does not present the same limitations on data set size. 


Lee and Stolfo [20] develop a hybrid network and 
host-based framework based on data mining tech- 
niques, such as sequential patterns mining and episodes 
tules, to address the problem of improving attack detec- 
tion while maintaining a low false positive rate. The 
system detects attacks combining different models and 
comparing them with actual traffic features. Bench- 
marks have been conducted using the DARPA 1998 
data set [21]: detection score for different attack ty- 
pologies has a minimum value of 65% with a false 
positive rate always below 0.05%. Since the authors 
use a different data set, we cannot compare directly 
the two approaches: however, we can notice that our 
approach does not use information collected from the 
operating system hosting the monitored network ser- 
vice(s), thus ATLANTIDES can work on-line without 
affecting the host performance. 

Identifying False Positives 

Pietraszek [33] tackles the problem of reducing 
false positives by introducing an alert classifier system 
(ALAC, Adaptive Learner for Alert Classification) 
based on machine learning techniques. During the 
training phase, the system classifies alerts into true 
and false positives, by attaching a label from a fixed 
set of user-defined labels to the current alert. Then, the 
system computes an extra parameter (called classifica- 
tion confidence) and presents this classification to a 
human analyst. The analyst’s feedback is used to gen- 
erate training examples, used by the learning algo- 
rithm to build and update its classifiers. After the 
training phase, the classifiers are used to classify new 
alerts. To ensure the stability of the system over time, 
a sub-sampling technique is applied: regularly, the 
system randomly selects n alerts to be forwarded to 
the analyst instead of processing them autonomously. 
This approach relies on the analyst’s ability to classify 
alerts properly and on his availability to operate in re- 
al-time (otherwise the system will not be updated in 
time); we believe that these (demanding) requirements 
can be considered acceptable for a signature-based 
NIDS (where the analyst can easily inspect both the 
signature and network packet(s) that triggered the 
alert), but it could be difficult to perform the same 
analysis with an anomaly-based NIDS. Benchmarks 
conducted over the 1999 DARPA data set, using Snort 
to generate alerts, show an overall false positive re- 
duction of over 30% (details on single attack protocols 
are not given). 


The the main differences between ALAC and 
ATLANTIDES include: (a) ALAC does not consider 
the outgoing traffic, and (b) ALAC relies heavily on 
the expertise and the presence of an analyst (in AT- 
LANTIDES, all the IT specialist has to do is to set the 
thresholds). 
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Julisch [15] presents a semi-automatic approach, 
based on techniques which discover frequently occur- 
ring episodes in a given sequence, for identifying false 
positives based on the idea of root cause: an alert root 
cause is defined as “‘the reason for which it occurs.” 
The author observes that in most environments, it is 
possible to identify a small number of highly predomi- 
nant (and persistent) root causes: thereby removing 
such root causes drastically reduces the future alert 
rate. Benchmarks conducted on a log trace from a 
commercial signature-based NIDS deployed in a real 
network show a reduction of 87% of false positives. 
No further details are given about the testing condi- 
tion, network topology or traffic typology. We cannot 
compare directly this approach with ATLANTIDES 
because the data used by the author is private, never- 
theless we can notice that this approach is applicable 
only to signature-based NIDes, while ATLANTIDES 
is effective with anomaly-based systems too. 


Analyzing output traffic The idea of analyzing 
(and correlating) the output of a (possible) compro- 
mised system as been used before in the context of 
worm detection. 


Gu, et al. [14] scan the output traffic for specific 
port numbers. When an anomaly has been detected in 
the incoming traffic directed to a certain destination 
service port, their system start monitoring the output 
traffic to check whether the host tries to contact other 
systems using the same destination service port: if this 
is the case then the system is probably infected by a 
worm. Wang, et al. [41] proceed in a similar way, 
comparing outgoing to incoming traffic, looking for 
similarities: when an anomaly has been detected in the 
incoming traffic, the anomalous traffic is cached and 
compared to subsequent outgoing traffic (to detect 
polymorphic worms). A successful match indicates 
that the host has been infected and that the worm is 
trying to replicate itself, infecting other hosts. Any 
other kind of attack will not be handled by the system. 
In contrast, our solution presents a general architecture 
to carry out a complete anomaly detection on the out- 
put to reduce false positives of any NIDS placed on 
the input channel. Indeed we have shown that our ar- 
chitecture works well in combination with both a sig- 
nature and an anomaly-based input NIDS. 


Conclusion 


In this paper we present ATLANTIDES, an ar- 
chitecture for automatic alert verification exploiting in 
a structural way the detection of anomalies in the out- 
put traffic of a system. ATLANTIDES can be used for 
reducing false positives both in signature and anom- 
aly-based NIDSs. The core of ATLANTIDES consists 
of an output anomaly detector (OAD), which com- 
pares output traffic with a model it has created during 
the training phase. To reduce false positives on the in- 
put NIDS (be it signature or anomaly-based) monitor- 
ing the incoming traffic, ATLANTIDES checks if the 
communication raising an alert in the input NIDS 
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actually produces an anomaly in the outgoing traffic 
too. In this case (and in another exceptional situation), 
the alert is forwarded to the IT specialist, otherwise it 
is discarded. The fact that the OAD is anomaly-based 
(rather than signature-based) allows it to adapt to the 
specific network environment/service, and to work in 
an unsupervised way (at least, after the setup). Anom- 
aly-based systems typically use a distance function 
and a threshold to discern anomalous from licit traffic. 
We introduce a simple heuristic to set ATLANTIDES 
threshold in an automatic, though effective, way, to 
further ease the management for IT security specialists 
(which can in case adjust the threshold value). 


Benchmarks on a private data set and on the 
DARPA 1999 data set show that ATLANTIDES deter- 
mines a reduction of false positives between 50% and 
100% in most of the cases, without introducing any 
extra false negative, easing signifincantly the manage- 
ment of NIDSs. 


One possible extension to our architecture is 
adding additional information to make the detection of 
anomalies in the output more precise: this information 
(e.g., the usual amount of bytes sent back from the 
server and the communication duration) could be in- 
cluded in the model and evaluated as well. Our archi- 
tecture has been designed to work with TCP-based 
network services: although it could be easily adapted 
to work with UDP-based services, there exist some is- 
sues related to this protocol. In fact UDP is a connec- 
tion-less protocol and this add some difficulties to dis- 
tinguish real connections from the ones using spdofed 
IP addresses. We will investigate this in future. 
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ATLANTIDES Pseudo-code 


In this section we give a semi-formal description of how ATLANTIDES works. 


DATA TYPE 

1 = length of the longest packet payload 

PAYLOAD = array [1..1] of [0..255] /* packet payload */ 

HOMENET = set of IP addresses /* hosts inside the monitored network */ 


HOST = RECORD [ 
address: IP address e N 
port: TCP porte N 

] 


PACKET = RECORD [ 
source: HOST 
destination: HOST 
payload: PAYLOAD 


] 
alert = RECORD |[ 
alert: 
—oo if input NIDS is SBS 
value € Real if input NIDS is ABS 
processed: BOOLEAN /* tracks a processed alert by the OAD */ 
true_alert: BOOLEAN /* alert is marked as an incident */ 
] 
DATA STRUCTURE 
teN /* number of packets used for OAD training */ 
oad € NIDS /* ABS analyzing outgoing network traffic */ 
out_threshold € Real /* OAD threshold */ 
teN /* time value to wait for output */ 
pre-alerts = set of alerts /* alerts received from the NIDS monitoring incoming traffic */ 
INIT PHASE /* TT specialists set out_threshold and t values */ 
TRAINING PHASE 
INPUT: 
p: PACKET /* outgoing network packet */ 
fort:s=l1itot /* first, train the OAD with t samples */ 


oad.train(p.source.address, p.source.port, p.payload) /* POSEIDON builds a profile for each monitored service */ 
end for 


TESTING PHASE 


INPUT: 
p: PACKET /* outgoing network packet */ 


OUTPUT: 
true_alerts: set of alerts 


for each aé pre-alerts do /* checks if the packet belongs to a communication 
marked as anomalous by the input NIDS */ 
if (match_alert(a, p) = TRUE) then 
anomaly_score := oad.test(p.source.address, p.source.port, p.payload) 
/* tests if the output is anomalous */ 
if (anomaly_score > out_threshold) then 
a.true_alert := TRUE 
true_alerts.add(a) 
end if 
a.processed := TRUE 
end if 
end for 
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for each aé pre — alerts do /* missing-output-response handling */ 
if (a.processed = FALSE) and (current_time > t) then 
a.true_alert := TRUE 
a.processed := TRUE 
true_alerts.add(a) 
end if 
end for 
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ABSTRACT 


Problem determination remains one of the most expensive and time-consuming functions in 
system management due to the difficulty in automating what is essentially a highly experience-de- 
pendent task. In this paper we study the characteristics of problem tickets in an enterprise IT infra- 
structure and observe that most of the tickets come from very few products and modules, and OS 
problems present higher resolving duration. We propose PDA, a problem management tool that 
provides automated problem diagnosis capabilities to assist system administrators in solving real- 
world problems more efficiently. PDA uses a two-level approach of proactive, high-level system 
health checks, coupled with rule-based “drill-down” probing to automatically collect detailed in- 
formation related to the problem. Our tool allows system administrators to author and customize 
probes and rules accordingly and share across the organization. We illustrate the usage and bene- 
fits of PDA with a number of UNIX problem scenarios that show PDA is able to quickly collect 
key information through its rules to aid in problem determination. 


Introduction 


Computer system administrators (SAs) play a 
number of important roles in managing enterprise IT 
infrastructure, either as members of internal IT depart- 
ments, or with IT service providers who remotely 
manage systems for customers. In addition to handling 
installation, monitoring, maintenance, upgrades, and 
other tasks, one of the most important jobs of SAs is 
to diagnose and solve problems. 


In IT services environments, the problem man- 
agement process (defined, for example, in ITIL [4]) 
describes the steps through which computer problems 
are reported, diagnosed, and solved. A typical se- 
quence is for a problem ticket to be opened by a call to 
the customer helpdesk, or by an alert generated by a 
monitoring system. This is followed by some basic di- 
agnosis by first level support personnel based on, for 
example, well-documented procedures. Simple issues 
such as password resets or file restoration can often be 
handled here without progressing further. 


If the problem needs further investigation, it is 
passed to second or third level personnel, who are typ- 
ically SAs with more advanced skills and knowledge. 
They often start with vague or incomplete descriptions 
of problems (e.g., “application is running very slow- 
ly,” “‘mail isn’t working,” or ““CPU threshold exceed- 
ed’’) which require significant investigation before the 
cause and solution are found. In the context of server 
support, administrators often consult monitoring tools 
that provide some specific system indicators, and then 
log in to the server to collect additional detailed infor- 
mation using system utilities. In the course of day-to- 
day problem management, this process is often the 
most time consuming and expensive task for SAs — it 


is difficult to automate, and requires field experience 
and expert knowledge. 


Unlike the first level support personnel, there is 
hardly any well-documented procedure one can refer 
to during this advanced problem management process. 
SAs usually rely on their own knowledge and experi- 
ence to diagnose the root cause of the problem. Be- 
cause of the complexity of the problems, it is a signifi- 
cant challenge to create a useful documentation about 
this process which can be referred by others. Especial- 
ly in the environment where supporting teams are 
globally distributed, how to creat and share this 
knowledge is a big challenge. 


In this paper we describe Problem Determination 
Advisor (PDA), a system management tool for servers 
that provides health indicators and automated problem 
diagnosis capabilities to assist SAs in solving problems. 
PDA is intended to be used by second or third level 
SAs who diagnose and address problems that cannot be 
handled by first level support (i.e., helpdesk). PDA uses 
a two-level approach that provides high-level health 
monitoring of key subsystems, and scoped probing that 
collects additional details in an on-demand, rule-based 
fashion. This approach has the advantage of performing 
detailed “drill-down” probing only when it is relevant 
to the problem at hand, hence avoiding the overhead of 
collecting such data all the time. Moreover, PDA’s 
probes and problem determination rules are not deter- 
mined arbitrarily; they are crafted based on an exten- 
sive empirical study of real problems that occur in prac- 
tice and expert rules drawn from system administrator 
best practices. The rules are also customizable to better 
serve a particular environment or purpose. Our specific 
contributions include: 
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© Characterization of server support problems: 
using real problem tickets from a diverse group 
of servers in a large enterprise, we study the 
relative frequency of various types of problems, 
and determine categories of commonly occur- 
ring problems based on their nature and fre- 
quency, and study the relative difficulty of re- 
solving different types problems. 

¢ Problem determination rules for scoped probing: 
based on examination of actual problem descrip- 
tions and their solutions, tools and scripts used 
by SAs in practice, and knowledge capture from 
SAs, we develop a set of rules that determine 
how probes should be dispatched to automatical- 
ly collect the information necessary to assist the 
diagnosis of various types of problems. 

° Extensible PDA tool architecture: we codify 
problem determination rules and best practices 
in a tool that can be expanded as new rules are 
developed or as specific needs arise in different 
IT environments. 


The measure by which most system management 
tools, and problem determination tools in particular, 
are judged is the reduction they enable in the time or 
effort needed to solve problems. As such, we illustrate 
the usage and benefits of PDA with a number of 
UNIX problem scenarios drawn from problem tickets, 
discussions with SAs, and system documentation. In 
these scenarios, SAs must know which information to 
collect and how to collect it (manually) in order to 
solve the problem. We show that PDA is able to 
quickly collect key information through its problem 
determination rules to aid in finding the cause, or in 
some cases pin-pointing the problem precisely. This 
allows expert SAs to solve problems more efficiently, 
and less experienced SAs to benefit from the diagnos- 
tic best practices codified in the tool. 


Our current library of problem determination rules 
and probes is geared toward system software problems, 
for example on commercial and open source UNIX 
platforms. However, the general approach of PDA is 
applicable to other problem areas, particularly applica- 
tions and middleware (which are a significant fraction 
of all problems). As we discuss in Section Problem 
Characterization, the bulk of reported problems are re- 
lated to a relatively small number of specific problem 
areas, and this holds across problems related to applica- 
tions, platform, networking, etc. Hence, developing 
rules based on best practices for the most commonly 
encountered problems is quite feasible. However, we 
expect that this model is more beneficial for IT services 
environment where similar platform and configuration 
co-exisit. For heterogeneous environement such as uni- 
versities, the best practices may vary from each other. 

In the next section, we present the results of a 


characterization study of problem tickets which we use 
to inform our choice of system probes and the design of 
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problem determination rules. Section PDA Design and 
Implementation follows with a description of the PDA 
tool design and implementation. In Section Problem de- 
termination experiences with PDA we evaluate PDA’s 
efficacy in the context of several specific problem sce- 
narios. Section Related work briefly discusses some of 
the related work. We conclude the paper in Section 
Summary with a discussion of our ongoing work in the 
implementation and evaluation of PDA. 


Problem Characterization 


We begin with a characterization of real-life 
problems from a large, distributed commercial IT en- 
vironment. The problems are drawn from an analysis 
of about 3.5 million problem tickets managed through 
a problem tracking system over a nine-month period. 
These tickets contain rich amount of information in- 
cluding a problem description, indication of the affect- 
ed system, and metadata such as timestamps, severity, 
and identity of the SAs handling the tickets. In our 
analysis, we derive a number of statistical characteris- 
tics that allow us to better understand the different cate- 
gories and characteristics of common problems. Specif- 
ically, we focus on: 

¢ which applications contribute to the majority of 
the problems, and in particular whether a few 
applications are responsible for most of the ob- 
served problems 

° what are the most common causes of applica- 
tion problems 

¢ what portion of problems arise due to applica- 
tion vs. operating system-level issues 

¢ how much time is spent in resolving different 
types of problems 


Our objective in this section is to develop in- 
sights from the analysis of problem tickets in a real 
enterprise IT environment, and later use these insights 
to develop tooling for improving the efficiency of 
problem diagnosis and resolution. 


Fields in Problem Tickets 


We examine a number of attributes of each ticket 
including both structured attributes with well-defined 
values, and unstructured attributes which are mostly 
free-text. Structured attributes contain information such 
as a ticket’s open and close time, incident occurrence 
date, SA user ID, and some enumerated problem char- 
acteristics such as problem type, product name, etc. 
Problem description and solution description are free- 
text. In this study, we are particularly interested in the 
following data fields: 

© product name: There are about 600 products ap- 
pearing in the tickets. Most of them are applica- 
tion names, while operating system or platform- 
related problems are categorized by general 
terms such as AIX, HP-UX, Linux, Windows. 

© product component: Each product is further bro- 
ken down into various predefined components. 
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They give finer-grained information about the 
problems. For example, components within the 
Windows product include bootup, explorer, system 
errors, password. 
e product module: This is the finest-grain infor- 
mation available in the problem database — the 
module identifies the system sub-component 
that is having the problem. For example, in 
Windows, the bootup component is divided into 
different modules such as safe mode, unable to 
boot, inaccessible boot, and explorer contains navi- 
gation, move/copy Files, search. 
ticket type: When a ticket is first opened, it is 
categorized into a type, for example: error, per- 
formance, information request, load request and 
others. When studying common problems, we 
focus on those tickets with type of error and per- 
formance because they usually require further 
diagnosis to identify the root cause. 
cause code: This field has 34 values to describe 
various problem causes. It is particularly useful 
when other fields such as component and module 
are not specified. For example, all of the tickets 
related to the Linux product name have software 
as the component field and OS for module. 


Problem Characteristics Overview 


To understand the problem tickets in the system, 
we start from statistical characteristics of the tickets 
based on the above-mentioned fields. We make the 
following observations: 

° Most of the problem tickets arise from a few 
products. 
We examined all tickets according to the prod- 
uct name field. Among the 600 products, we 
observe that 50 products, which are less than 
10% of the total products, account for 90% of 
all tickets. Figure 1 shows a cumulative fre- 
quency graph of the number of products and the 
percentage of the tickets with that product. This 
observation is similar to the failure characteris- 
tics observed on Windows XP machines [5]. 
Among the top 50 products with the most prob- 
lem tickets, the number one product is an enter- 
prise email system, followed by a virtual pri- 
vate network (VPN) application, and then a 
popular operating system. Figure 2 shows more 
detail about the number of tickets and their dis- 
tribution, and Table 1 shows the top 10 prod- 
ucts and the number of tickets from each. 
Within each product, most of the problems 
come from a few modules. 
Next, we analyze the top products which have 
the most tickets by categorizing them according 
to product components and modules. Table 3 
lists the 18 modules which comprise 70% of the 
tickets reported for the mail system, while the to- 
tal number of modules defined for this product is 
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162. For the VPN application, we see an equally 
skewed distribution: 70% of the tickets are from 
six modules, out of a total of 70 modules. Table 
2 lists these top six modules. Details of product 
components are not listed here because they re- 
veal less information than the modules. 
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Figure 1: Cumulative frequency of the tickets contrib- 
uted by the set of products. 
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Figure 2: Number of tickets from the top 10% prod- 
ucts with the most problem tickets. 


These two observations together suggest that, by 
focusing on problem determination for a relatively 
small number of products and corresponding modules, 
we are able to cover a large portion of problem tickets. 
However, it is also important to note that each module 
may in fact exhibit many problem symptoms and pos- 
sibly many root causes. This implies that a practical 
tool should be highly customizable in order to address 
symptoms that might be unique to a particular envi- 
ronment. 


Operating System Problems 


In terms of quantity of problem tickets, operating 
system (OS) problems are not as prominent (with the 
exception of Windows-related tickets, which consist 
of about 11% of the all tickets). However, system soft- 
ware or OS tickets are often the most diverse and re- 
quire significant effort to diagnose. Figure 3 illustrates 
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the distribution of time spent in resolving problem 
tickets for top applications and systems/OSes. Our 
analysis indicates that nearly an order of magnitude 
more time is spent in the resolution of problems aris- 
ing from UNIX system issues, than on other types of 
problems. VPN and mail applications, both compris- 
ing a large number of application-related problem 
tickets, need considerably less time to resolve com- 
pared to UNIX system related problems. 






Cumulative 


Product name #oftickets  % of tickets 
Mail app 618575 18% 
VPN app 346812 28% 
Windows OS 331978 38% 





Mail app (prev version) 229415 45% 


PC hardware 171078 53% 


Network connectivity 170937 58% 


Software installer 92358 61% 


Mainframe local app 59314 63% 


Telephone 56396 64% 


Desktop security audit tool 46470 66% 


Table 1: Top 10 products in the ticket database. 




















Cumulative 


Product name #oftickets  % of tickets 
RESET 114477 38% 
LOOPING 23620 45% 
INTRANET APP 22045 53% 
INFO 20537 60% 
INTER/INTRA NET 14220 64% 
CLIENT 13409 69% 


UNABLETOCONNECT 13390 73% 


Table 2: Product modules and the number of tickets 
from these modules for the VPN application. 








We also examine the top problem types for OS 
platforms. Unfortunately, there are no OS components 
and modules defined except for Windows (perhaps 
due to the complexity of OS problems). Instead, we 
use the cause code field in analyzing OS tickets. We 
combine similar cause code values to arrive at a set of 
broader problem categories including: application, con- 
figuration, hardware, request/query (e.g., password re- 
set or howto questions), duplicate (i.e., multiple report- 
ed problem), storage, network, human error, unsupport- 
ed (i.e., out of scope for the support team). Note that 
“application” in the OS tickets differs from application 
as standalone product — these are mostly system pro- 
cesses or services shipped with the OS such as Send- 
mail, NFS, Samba etc. 


Common OS Problems 


From the problem categories in Figure 4, we fur- 
ther examine ticket details in a few categories to 
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identify commonly occurring problems that occur in 
each category. In particular, we focus on problem cate- 
gories related to systems software and application-re- 
lated issues on the UNIX platform, including: applica- 
tion, configuration, storage, and network. We use a 
combination of ticket clustering based on structured 
attributes, and manual inspection of problem and solu- 
tion description text, to extract a set of typical prob- 
lems. We describe a few of these sample problems be- 
low.’ Recall that the problems are grouped by cause 
code values that are indications of the identified cause; 
this may be a different category than the original prob- 
lem description would initially indicate. 


Cumulative 
Product name #oftickets %of tickets 


OPENING DB 54745 12% 


SETTINGS 17453 45% 
17013. 49% 
13785 52% 
13049__55% 
CHANGE 12229 57% 
SETUP 10631 60% 
SENDING 8786 64% 


GENINFO 7034 69% 
AUTHORIZATION 6452 70% 


Table 3: Product modules and the number of tickets 
from these modules for the mail application. 
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Figure 3: Time spent in solving tickets in top two 
products and in three OSes. 
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‘Specific hostnames, directory names etc.have been anon- 
ymized in these samples. 
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Normalized percentage of problem tickets 
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Figure 4: Categorization of problem tickets for Linux, 
UNIX and Windows servers. We examined 4,598, 
42,998, and 331,978 tickets for each platform, re- 
spectively. 


Configuration Problem Samples 
© Multiple email messages from dissimilar do- 
mains appear to be blocked. The reason was 
due to a change in spam filtering mechanism/ 
update. 
¢ SSH failing after OS version upgrade. The so- 
lution of this problem was to change permis- 
sions on /dev/random and /dev/urandom to 644 
e The following NFS filesystems did not mount 
after miwsrv1 rebooted today: 
efkxeastapps.abc.xyz.com:/u0 
efkxeastapps.abe:/ul 
efkxeastapps.abc.xyz.com:/u2 
efkxeastapps.abe.xyz.com:/u3 
efkxeastapps.abc.xyz.com:/u4 
The reason of this problem was because the filesys- 
tem has been updated using another type. Solution 
was to mount the filesystem using that type name. 


Application Problem Samples 


© the databases will not start because of some 
TCPIP problem. 
The reason was because of port mapper down. 
© (03:38 OPS received the following red alert on 
UNIX icon for abcdef03: 
Host:named.my.ibm.com 
Msg:The percentage of available swap 
space is low (20.12939453125 percent). 
The solution was to stop unwanted processes. 
¢ SRM data is not getting retrieved from STI- 
BACKUP due to the refusal of the SFTP con- 
nection. Symptoms suggest SSH is not running. 
The problem was solved by changing /usr/bin 
link to ssh and /etc/inetd.conf path then restart- 
ing ssh. 
Storage 
© Base_OS_Monitors critical 98.006800 Percent 
space used (/var) This Critical Sentry2_0_disk 
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usedpct event was received by a monitoring 
server at 3/6/2006 6:20 EST. 

Typical solutions to this space usage problem 
include removing old files, large files, expand- 
ing file system space, and compressing the old 
files, etc. 

System backup failed for sx0000e0 on 18082006. 
More details could be found in the logfile /var/ 
adm/mksysb.out. 

Reason for this problem was because some 
temporary files were not found. 

A summary of examples of top problems that we 
found among problem tickets is shown in Table 4 in 
the first column. 


Implications From Problem Ticket Analysis 


Our observations in characterizing problems in a 
large IT environment has a number of implications 
that guide the design of PDA. Recall that the first key 
finding was that a few products or applications are re- 
sponsible for the majority of problem tickets opened. 
The second observation was that the cause of these 
problems can be attributed to a relatively small set of 
functional components of these products. These results 
together imply that problem determination tooling that 
addresses a finite and fairly small set of important 
problem types can in fact cover a significant portion of 
problem tickets that are observed in practice. 


We also observed that, in terms of time spent on 
resolution, UNIX system related problems are rela- 
tively difficult to resolve. Therefore, this is an impor- 
tant problem area to consider. Improving diagnosis ef- 
ficiency for UNIX-related problem can potentially 
provide a significant value in terms of reduced time 
and effort. 


These observations motivate our implementation 
of PDA, which addresses a number of key categories 
of system software or OS-related problems. While our 
intention is to broaden the applicability of PDA to oth- 
er problem types (e.g., applications), our initial focus 
on OS problems on UNIX platforms is justified based 
on the analysis presented above. 


PDA Design and Implementation 


In this section, we first describe the overall de- 
sign, architecture, and implementation of Problem De- 
termination Advisor. We also describe how knowledge 
gathered from problem tickets is incorporated in PDA’s 
automated problem diagnosis capabilities. Details of 
problem determination rules and system probes, as well 
as some realistic examples, are also illustrated below. 


Design Overview 


From the previous section, we have observed 
that a large percentage of problem tickets are related 
to a small number of products, and within each, most 
problems have only a few primary causes. We use a 
two-level approach that provides high-level health 
monitoring of key subsystems, and scoped probing 
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that collects additional system details. In Table 4, the 
second column lists some of the high-level monitors 
that are used in practice to identify the occurrence of 
common problems, while the third column shows ex- 
amples of diagnostic probes that can help identify the 
cause of common problems in the corresponding cate- 
gory. The list of problems reported here represents a 
sample of the problem scenarios we have examined, 
and is by no means complete. We are continuing to ex- 
pand the list of common problems and corresponding 
problem determination probes as we examine addi- 
tional tickets and continue the knowledge capture of 
SA best practices. 


If the full complement of monitors and probes 
are always active (e.g., executing periodically), they 
would likely impose a noticeable overhead on produc- 
tion systems. Hence, the two-level probing approach 
uses (i) periodic, low-overhead monitoring to provide 
a high-level health view of key subsystems, and (ii) 
detailed diagnostic probes when a problem is detected. 
This is similar to the way SAs tackle problems — the 
difference being that PDA tries to collect the relevant 
problem details automatically. 


The knowledge of which diagnostic probes should 
be run when a problem is detected is encoded in prob- 
lem determination rules. These rules are represented in 
a decision tree structure in which the traversed path 
through the tree dictates the series of diagnostic 
probes that are executed. At each node, the output of 
one or more diagnostic probes is evaluated against 
specified conditions to decide how to proceed in the 
traversal. In our implementation, diagnostic probes 
generally use available utilities on the platform direct- 
ly to retrieve the needed information. 

Problem Determination Rules 

Figure 5 shows a sample rule tree which can diag- 
nose problems related to the network connectivity of a 
managed server. The corresponding health monitor con- 
siders the system’s network connection to be available 
if it is able to reach (e.g., ping, or retrieve a Web page) 
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several specified hosts outside of its subnet. These 
could be other servers it depends on, or well-known 
servers on the Internet, for example. If a disconnection 
is detected by the health monitor, scoped probing using 
diagnostic probes will be invoked to gather information 
to help determine the root cause of the problem. Ac- 
cording to the rule tree, the first diagnostic probe 
should check the network stack by pinging the loop- 
back address. If no problem is found, the next node in 
the rule tree will dispatch another diagnostic probe to 
check that a default gateway is defined in the local 
routing table, and that it is reachable. If it is unreach- 
able, the problem might be with the local subnet or 
network interface card. Otherwise, potential DNS-re- 
lated problems are checked, for example verifying that 
/etc/resolv.conf exists, and that it contains DNS server 
entries (that are reachable). Clearly some diagnostic 
probes have dependencies (e.g., check /etc/resolv.conf 
exists before checking that a DNS server is reachable) 
and have to be executed in a certain order. In the ab- 
sence of dependencies, probes could be ordered differ- 
ently, perhaps tailored to the likelihood of certain 
types of failures in a given environment. 


ibm.com 
Health 







‘ ‘y Diagnostic 
: Check TCP stack 
! ping 127.0.0.1 | frre TOP ste Probes & 
: Conditions 





heck local router 
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- ee 
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Figure 5: Sample rule for network-related problems. 


Category Typical Problems Health Monitors Diagnostic Probes 


OS/App upgrades, changes 
to various configuration 
files, firewall, spam filtering 


App errors due resource ex- 
haustion (incl. CPU, memo- 


Configuration 


Application 


ry, filesystem), app prereq- 
uisites, path/setup, etc. 


Rahuon Connectivity, performance Check DNS, firewall, rout- Traceroute, TCP dump, net- 
ee ing table, NIC work options, NIC 


Capacity, data corruption, 
mount problem, perfor- 
mance, disk swap, etc. 


Storage 


Track changes to upgrades, 
configuration files 


Check resource usage, error 
log, process status etc. 


Check available space, I/O 
rate, error logs 





File privilege, active users, 
file diff with backup 


Processes having most of 
the CPU, mem, IO, etc. 


Mount options IO history, 
big files 


Table 4: Examples of common problem symptoms and corresponding monitors and probes used to identify these 


problems. 
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Our problem determination rules are primarily 
derived from inspection of problem tickets and by 
capturing best practices from SAs (i.e., through dis- 
cussions, reviewing their custom scripts and proce- 
dures, etc.). In the case of problem tickets we extract 
rules by examining the steps through which a problem 
was diagnosed and resolved. When the detailed steps 
are available, creating a corresponding rule tree is fair- 
ly straightforward. Some of the tickets, however, do 
not have much detail beyond the original problem de- 
scription and perhaps a few high-level actions taken 
by the SA. In such cases, we must manually infer the 
probes needed to collect the appropriate diagnosis 
data. In our rule extraction process, we use a combina- 
tion of data mining tools to categorize problem tickets 
and identify distinguishing keywords, followed by 
varying degrees of manual inspection to better com- 
prehend the problem and solution description text. The 
data mining tools are not described in detail here. Our 
experience with problem tickets so far leads us to be- 
lieve that some amount of manual inspection is neces- 
sary to derive corresponding rules, however we con- 
tinue to investigate automated techniques to assist the 
rule derivation process. 


System Architecture 


We have implemented a fully functional prototype 
of PDA, including the probing and data collection 
mechanisms, rule execution, and a Web-based interface. 
Figure 6 shows the overall architecture of PDA, which 
contains three major components: probe daemon, PDA 


a 
managed servers ad 
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server, and the user interface. Our probe daemon is im- 
plemented in C for performance consideration and for 
easy deployment. PDA server is implemented in Java 
and our current user interface backend uses IBM Web- 
Sphere Portal Server. We use MySQL as our database 
storage. 
Web User Interface 

The Web UI allows SAs to perform various tasks 
from a single interface accessible from any worksta- 
tion. It gives SAs an at-a-glance health overview of all 
of the servers being managed by clearly highlighting 
systems and components that have problems or are 
predicted to have problems in the near future. Whether 
or not an indicator implies a problem is determined by 
the corresponding rule, as discussed further below. In 
addition to showing the current health view of man- 
aged servers, we also allow SAs to look at the status 
of the servers at earlier points in time. This feature is 
useful when a reported problem is not currently evi- 
dent on the system, but may be apparent when view- 
ing system vitals collected earlier. It also is crucial for 
observing trends in certain metrics. Based on our dis- 
cussions with SAs supporting commercial accounts, 
this feature is particularly useful to them in gaining a 
better understanding of the behavior of the managed 
systems. 


Besides monitoring, the Web UI allows SAs to 
perform some simple administrative tasks such as 
adding another server to be monitored, updating a us- 
er’s access privilege (as a root SA), adding or removing 


} probe controller 
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probes on a managed server, etc. The Web UI also pro- 
vides an interface for SAs to author probes and con- 
struct rules from new and existing probes. Rules and 
probes constructed using this interface are stored in 
XML to facilitate sharing with other SAs or reusing 
them across multiple platforms. This feature is dis- 
cussed in more detail in Rule Sharing Section. 


We are also in the process of implementing a 
Web based secure login mechanism so that SAs can 
quickly switch from a shell from one machine to an- 
other using pre-defined credentials. This allows SAs 
to better visualize monitored data using the graphical 
interface, while still having access to a low-level com- 
mand line interface from a single tool. 


Probe Daemon 


The probe daemon is a small program that is runs 
on managed servers whose primary functions are to 
schedule the execution of probes according to the fre- 
quency set by the SA, and to interface with the PDA 
server. The PDA server sends command to start a new 
probe, stop an existing probe, change the periodicity 
of a probe, etc. When starting a new probe, the probe 
daemon can download the probe from a central probe 
repository if the probe does not exist or is not up-to- 
date locally. After probes have finished executing, the 
probe daemon is also responsible to send probe results 
back to the PDA server. 


PDA Server and Rule Library 


Most of the information exchange and process- 
ing is handled by the PDA server. Periodically it re- 
ceives probe results from the probe daemon, stores the 
results in a history database, and triggers rule execu- 
tion if there is a corresponding rule for a particular 
probe. The history database stores collected data and 
also serves as a repository for important configuration 
files that are tracked by PDA. 


The rule execution engine is the most important 
part of the PDA server. It parses rules, each defined in a 
separate XML file located in the database, and converts 
them to an in-memory representation for evaluation. 
There are two ways to evaluate a rule. Typically, a rule 
is triggered by a periodic probe which is defined to be 
the root node of the rule tree. The trigger can be a 
threshold violation, change in a key configuration file, 
or other detected problem. As a second method, an SA 
can execute a rule to initiate it manually, to proactively 
collect information related to specific subsystem. 


In both cases, as the rule tree is traversed, a com- 
mand is sent to the probe daemon to execute the re- 
quired probe and return the result. The result is used to 
make a decision to continue collecting more informa- 
tion or stop the rule execution. Some rules also use 
historical data to decide what should be collected next, 
or what should be displayed to the SA. 


Since PDA supports management of multiple 
groups of servers (e.g., for different customers), the 
PDA server also keeps track of which servers are 
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managed by which SAs. This also implies that each 
server or group can have different active rules and 
probes; PDA supports this notion of multi-tenancy in 
the rules library and rule execution engine. 


Rules and Probes 


A probe is usually implemented as a script (e.g., 
Perl, shell, etc.) that either executes native commands 
available in the system or interfaces with other moni- 
toring tools deployed in the environment. The probe 
parses and aggregates the output of the commands, 
and returns the results as an XML document. In order 
to make it easy to add new probes to PDA, the schema 
is a simple and generic key-value pair representation. 
When interfacing with other monitoring tools to col- 
lect data, we write adapters to convert their output to 
the XML format for the PDA server. 


Figure 7 is an example output from a probe that 
monitors Ethernet interfaces. 


Rules are triggered automatically by a monitor- 
ing probe which appears as the first node of the rule 
true. For example, Figure 8 shows a sample rule 
chk_interface. It will be triggered by a probe called 
chk_eth. In this rule, the first step tests if the number 
of collisions is beyond a certain threshold. If the 
threshold is exceeded, the next probe, chk_switch, is 
executed to collect some information about the net- 
work switch, for example related to the firmware ver- 
sion. This type of scoped probing minimizes monitor- 
ing overhead and expedites the problem determination 
process. In the case where the probe in the first node 
does not present, or the problem is reported by other 
channels, such as problem ticket, a rule can be execut- 
ed manually by the SAs. 


<results probename="chk_eth"> 
<result> 
<key> INTERFACE </key> 
<value> ethl </value> 
</result> 
<result> 
<key> ERRORS </key> 
<value> 0 </value> 
</result> 
<result> 
<key> DROPPED </key> 
<value> 0 </value> 
</result> 
<result> 
<key> COLLISIONS </key> 
<value> 50234 </value> 
</result> 
</results> 


Figure 7: A sample probe output. 


Probe and Rule Authoring 


Given the heterogeneity of enterprise systems 
and applications, it is unrealistic to expect a single 
library of rules and probes to work in all IT environ- 
ments. For this reason, PDA is designed to be 
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extensible to allow authoring of probes and rules, 
either from scratch or, more commonly, based on ex- 
isting content. We provide templates for writing new 
diagnostic probes, and a way to logically group rules 
and their associated probes (e.g., based on a particular 
target application). Being able to quickly construct a 
rule for an observed problem from a set of existing 
probes can be very helpful to save SAs precious time. 


We have implemented a Web-based probe and 
rule authoring interface. We validate the input of re- 
stricted fields and perform some simple checks (e.g., 
for duplicated names in the repository). Validation of 
the probe code is similarly simple, comprising checks 
that the probe runs successfully and implements the 
necessary rule output formatting. Newly created rules 
require slightly more involved validation. For exam- 
ple, we validate that each node in the rule tree has the 
corresponding probe(s) available, and that the condi- 
tions being checked are supported by the probe. 


Once a new probe is authored, its data fields are 
created and stored directly in the corresponding tables 
in the database, and the PDA server is notified of the 
new probe name. If the probe is periodic and needs to 
be activated for monitoring, the probe daemon is con- 
tacted by the PDA server automatically to schedule 
the probe. If the probe is for diagnostics (i.e., a ‘“‘one- 
time” probe vs. periodic), it will be downloaded to the 
managed server when it is invoked by a rule. Rules are 
stored similarly, along with the XML representation of 
the tree structure. 


Since our probes are mostly scripts which will be 
running on managed servers, they pose a potential se- 
curity threat to the system if probes are malicious. 
Currently we rely on user authentication and out-of- 
band change approval for new probe and rule author- 
ing. It may also be feasible to use compilation tech- 
niques to perform some checks on the semantics of the 
scripts, for example to see if a probe is writing to a 


<rule rulename="chk_interface"> 
<node probename="chk_ eth" id="0"> 
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restricted part of the filesystem. We are investigating 
this in our ongoing work on PDA. 


Rule Sharing 


In our discussions with SAs, we found that shar- 
ing knowledge and experience between them is a con- 
siderable challenge. One potentially significant benefit 
of rule and probe authoring in PDA is the opportunity 
to share them with other SAs managing the same envi- 
ronment, or even those working in very different envi- 
ronments. 


Guaranteeing that rules and probes authored by 
one SA are applicable to problem resolution on other 
systems poses a number of difficulties. The primary 
one is the wide variety of platforms, operating sys- 
tems, and software. Most rules are largely platform-in- 
dependent as the information can be extracted on most 
OSes. However, the probes that actually collect the in- 
formation can be quite different on various platforms. 
Even on machines with the same OS, different patch 
levels or software configurations can very easily break 
probes. To make sharing of rules and probes more 
seamless to SAs, we annotate them with dependency 
information that indicates the platform and version on 
which they have been deployed or tested. 


Initially, we expect to deploy our tools in a fairly 
homogeneous environment, e.g., with mostly UNIX 
machines. We expect most dependency issues in such 
an environment to be solved relatively easily, for ex- 
ample by using a different binary/utility to obtain the 
same information. This technique can be carried over 
to managing other flavors of UNIX or Linux. As more 
users contribute rule and probe content over time, they 
will likely cover a more comprehensive set of plat- 
forms and provide a valuable way to accumulate and 
codify system management knowledge. 


PDA Usage Model 


The design of PDA is largely motivated by our 
experience in IT service provider environments, in 


<condition> COLLISIONS > 500 </condition> 


<true-branch> id="1" </true-branch> 


</node> 


<node probename="chk_switch" id="1"> 


<condition> MANUFACTURER == LINKSYS && 


MODEL == ETHERFAST && 


FIRMWARE VERSION <= 2.3.1 


</condition> 
<true-branch> 

alert("upgrade firmware") 
</true-branch> 


<false-branch> id="2" </false-branch> 


</node> 
<node ...> 


</rule> 


Figure 8: A sample rule file. 
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which globally distributed support teams manage the 
infrastructure belonging to a large enterprises. In these 
environments, creating, communicating, and adhering 
to best practices for systems architecture and manage- 
ment is a significant challenge. Global support teams 
often consist of administrators with greatly varying 
amounts of experience and knowledge — providing the 
ability to capture and operationalize problem determi- 
nation procedures for the whole team is very valuable. 
In addition to improving efficiency through automated 
collection of relevant information, it also allows SAs 
to follow similar procedures which could be designed 
by the most experienced team members. At the same 
time, PDA allows customization of problem determi- 
nation rules to account for different customer environ- 
ments or priorities. A PDA installation could include a 
standard set of problem determinations rules and asso- 
ciated probes that handle common or general problems 
(e.g., networking problems, excessive resource con- 
sumption, etc.). These could be supplemented with 
new content that is available from a central repository, 
or through local modifications of existing content. 


This usage model is particularly applicable for IT 
service providers who manage many customer infra- 
structures in a number of industries, since it is likely 
that there is a lot of similarity at the system software 
level. Furthermore, when the service provider has per- 
formed some degree of transformation in the customer 
environment, for example to move to preferred plat- 
forms and tools, the possibility for sharing and reuse 
increases. In IT environments belonging to universi- 
ties and industry research labs, we observe more het- 
erogeneity in OS platforms, applications, and usage, 
which makes it more difficult to develop problem de- 
termination best practices that are widely applicable. 
Nevertheless, the automation and extensibility features 
of PDA are still useful in these situations. 


Problem Determination Experiences With PDA 


This section describes some of our experiences 
in constructing rules and probes to diagnose practical 
problems with PDA. The rules we built based on prob- 
lem tickets and best practices from interviewing SAs 
are not comprehensive, but they address commonly 
occurring problems in realistic settings and can be ex- 
panded and enhanced by communities of users or ad- 
ministrators. 


Experience with NFS Problems 


NFS allows files to be accessed across a network 
with high performance, and its relatively easy configu- 
ration process has made it very popular in large and 
small computing environments alike. However, when 
problems occur, finding the root cause can take a sig- 
nificant amount of time due to NFS’s many dependen- 
cies. Most of the NFS-related problem tickets we ob- 
served are straight-forward to solve, but some have 
symptoms that are difficult to connect with their final 
solution. This often results in tickets being forwarded 
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multiple times to different support groups, sometimes 
incorrectly, before they are finally resolved. We found 
that most of these problems can be diagnosed with a 
few simple systematic steps. 


Sraw = ‘lssre -a‘; 


Slines = split(" ", Sraw); 


## Omitted code for error-checking 
i executing lssrc 


## Parses 1lssre output: 3 possible 
i output formats 

## 1. Subsystem Status 

## 2. Subsystem Group Status 

## 3. Subsystem Group PID Status 


foreach $line (@lines) 
$line =~ s/*s+|s+$//g; 
@lineElem = split(/st/, $line); 
SnumLineElem = scalar(@lineElem) ; 
Smatch = $lineElem[0] eq SARGV[O]; 
if (SnumLineElem == 4) 
if (Smatch && 
SlineElem[3] eq "“active") 
SfoundActive = 1; 
elsif (SnumLineElem == 3) 
if (Smatch && 
S$lineElem[2] eq "inoperative") 
S$foundButNotActive = 1; 
elsif (SnumLineElem == 2) 
if (Smatch && 
S$lineElem[1] eq "inoperative") 
SfoundButNotActive = 1; 
# Format output and dump to stdout 
if (defined(S$foundActive) ) ' 
&PDAFormatOutput(...); 
elsif (defined (S$foundButNotActive) ) 
&PDAFormatOutput(...); 
else 
&PDAFormatOutput(...); 


Figure 9: A simple probe that checks if a system ser- 
vice is currently running. 


The rule to determine NFS-related problems is 
shown in Figure 10 as a tree. The rule tree is traversed 
by information gathered from dispatching a series of 
probes. Some rules are intuitive — check for liveness 
of all NFS service daemons, e.g., nfsd, mountd, statd, 
and lockd, and check if these services are properly reg- 
istered with the portmapper. In Figure 9, we show an 
example probe that checks if a system service has 
been started and is currently running. As each probe 
does something very specific, it can be quickly and 
easily implemented and maintain. We used Perl to im- 
plement this probe, but probes can be written in any 
language as long as their output format matches the 
pre-defined format. There are also other more esoteric 
rules — statd should always start before mountd, or the 
exname option can only be used if the mfsroot option is 
specified in /etc/exports. However, we have seen that a 
majority of problem tickets can be addressed by the 
simpler checks, e.g., looking for a missing /etc/exports 
file, non-existent mount points, etc. Some probes are 
useful to exercise periodically to help SAs maintain a 
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healthy NFS service (e.g., check for hung daemons or 
abnormal RPC activities). Some can be run when 
changes are made to configuration files to check for 
potential problems caused by a recent change. Some 
can even be triggered manually in response to client- 
reported problems. 


The benefits of PDA’s automated problem deter- 
mination capability are best illustrated by considering 
a specific problem scenario. Consider the problem of a 
misconfigured system that starts nfsd before portmap- 
per. The misconfigured system will not allow users to 
access remote NFS partitions and the resultant prob- 
lem ticket has a correspondingly vague problem de- 
scription. In the absence of PDA, the SA must manu- 
ally narrow down the problem cause, starting for ex- 
ample, by logging into the affected systems, checking 
network connectivity, verifying firewall rules, etc. to 
make sure the NFS problem is not a side effect of an- 
other problem in the system. Having done that, the SA 
may then check /etc/exports for possible permission 
problems, see if all the defined mount points exist and 
run the pS command to confirm that all NFS-related 
services are running. He or she might need to run 
more diagnostic tools before finally discovering that 
nfsd was not correctly registered with the portmapper. 
Using PDA however, this misconfiguration problem 
can be discovered quickly by examining the results of 
the automated rule execution. 


Experience With Storage Problems 


We were surprised that a large percentage (90%) 
of the storage-related problem tickets we observed 
were related to simple capacity issues, e.g., disk parti- 
tions used for storing temporary files or logs being 
close to or completely full, thus hampering normal op- 
erations in the system. Given the large volume of such 
tickets, early detection and resolution of storage ca- 
pacity problems can potentially eliminate a large num- 
ber of tickets. In many environments, management 
tools monitor filesystems and raise an alert if the 
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utilization crosses a specified threshold. This approach 
has the disadvantage of potential false alarms if for ex- 
ample, a filesystem is normally highly utilized (e.g., 
an application scratch area). In PDA, we use a low- 
overhead profiler that periodically samples disk usage 
for each partition being monitored and have the profil- 
er raise an alert if a partition’s usage (within a time in- 
terval N) has an upward trend that is projected to be 
completely filled within the next 24 hours. A 24-hour 
period is chosen to allow an SA enough time to con- 
firm a problem and take action. An example illustrat- 
ing how the profiler is able to find storage capacity 
problems is shown in Figure 11. In this example, sim- 
ple linear regression is used for trending, which will 
detect that in interval 3 this partition will be full with- 
in a day and raise an alert. More complex regression 
methods can be used to further suppress false alarms. 


As an example similar to what we described in 
Section 2.4, consider the actual problem ticket with 
this description: Ops received the following alert on 
host:$<$hostname$>$ “Percent space used (/var/log) 
greater than 95% — currently 100%.”. This problem 
ticket was opened by the operations team after an alert 
was raised by a monitoring tool. Even though the 
problem may be easy to solve, by the time the SA re- 
ceives this ticket, the system may already be having 
problems for an extended time, disrupting its normal 
operations. Using the profiler, such problems can be 
accurately predicted and notifications can be provided 
to SAs earlier. 


Experience with Application Problems 


Application problems are the most frequently en- 
countered category of problems (as shown in Figure 
4), and also the most varied due to the myriad applica- 
tions running in enterprise environments. Some appli- 
cation problems can be detected with periodic probes 
similar to the way we diagnose NFS-related problems. 
These can check, for example, for the liveness of ap- 
plication processes, whether specific application ports 
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are open, or for the presence of certain application 
files. The extensibility of PDA can be leveraged to 
handle other types of application problems, such as 
newly discovered security vulnerabilities. A commu- 
nity of users can devise and distribute new probes and 
associated rules to detect new vulnerabilities, which 
can shorten the system exposure time and save SAs 
time in evaluating whether their systems are affected. 


One application for which we observed a signifi- 
cant number of problem tickets was the widely de- 
ployed mail server application, Sendmail. Traditional- 
ly, Sendmail configuration has been considered very 
complex and we expected the problems to be related 
primarily to setup issues or misconfiguration. Howev- 
er, again, most of the problems were actually caused by 
relatively simple issues — a congested mail queue, a 
dead Sendmail process, a newly discovered security 
vulnerability, etc. An example problem ticket related to 
Sendmail had the following description: Ops received 
the following alerts for <hostname> advising: “Total 
mail queue is greater than 15000 — currently 56345. 
Sendmail server <hostname> is not responding to 
SMTP connection on port 25.”. Using a periodic live- 
ness probe and a profiler (similar to the one for filesys- 
tems described above), the SA can be notified of a 
pending problem well before a critical level is reached. 
In the current implementation, PDA incorporates rules 
to check some Sendmail-related parameters, including 
process liveness and mail queue length. 
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Related Work 


Though in certain cases, problems can be solved 
by fixing the underlying systems. However, a large 
quantity of problems addressed by SAs are not system 
bugs, instead, they are configuration problems, com- 
patibility issues, usage mistakes, etc. Much of the pri- 
or work on system problem determination has cen- 
tered on system monitoring. While monitoring is 
clearly important for PDA, it is not a primary focus of 
our work. We designed PDA to be able to use data col- 
lected from any monitoring tool. Our analysis of the 
ticket data helps us choosing which part of the system 
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should be monitored, but not how to implement moni- 
toring probes. Our focus is instead on creating rules 
that help in diagnosing a problem once the monitoring 
system detects an issue. 


Recently, researchers have been building problem 
monitoring and analysis tools using system events and 
system logs. Redstone, et al. [8] propose a vision of an 
automated problem diagnosis system by observing 
symptoms and matching them against problem database. 
Works such as PeerPressure [11], Strider [12], and others 
[7] [6] address misconfiguration problems in Windows 
systems by building and identifying signatures of normal 
and abnormal Windows Registry entries. Magpie [2], 
FDR [10], and others [13, 3] improve system manage- 
ment by using fine-grained system event-tracing mecha- 
nisms and analysis. We have similar goals to these 
works but our approach is to bring more expert knowl- 
edge in diagnosing problems and allow for a dynamical- 
ly changing set of collected events. 


While most problem determination systems have 
a fixed set of events to collect, few use online mea- 
surements as a basis for taking further actions. Rish, et 
al. [9] propose an approach called active probing for 
network problem determination, and Banga [1] de- 
scribes a similar system on built for Network Appli- 
ance storage devices. Our system bears some resem- 
blance to these approaches but is more generalized 
and has tighter link with observed problems, for exam- 
ple through IT problem tracking systems. In making 
decisions about what additional information needs to 
be probed, we use expert knowledge and indicators 
collected from real tickets as opposed to using proba- 
bilistic inference. 


Our rules are conceptually similar to procedures, 
however, they are fundamentally different. Our rule 
combines knowledge and execution environment. The 
rule execution has intelligence to identify what is the 
next step and to execute the right diagnostic probes. 
Moreover, it is important to catch the system status 
when the problem just occurs. In current practice, SAs 
often need to reproduce a problem but the environ- 
ment may have changed. So performing diagnosis 
right after the problem occurs not only saves time, but 
also could be the only way to catch the root cause. 


Discussion and Ongoing Work 


In our design of PDA, we were able to use prob- 
lem ticket information as a guideline for the design of 
problem determination rules and associated probes. 
However, deriving rules from tickets still involves 
manual effort. We are investigating approaches to 
make this process more automated, but the varying 
quality of the free-text descriptions will be a continu- 
ing challenge. We are also investigating better models 
for the structured data which can more precisely cap- 
ture problem signatures. 


We use structured fields such as ticket type and 
cause code to categorize problem tickets and pick the 
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common problems as described in Section Common 
OS problems. Qualitatively speaking, our rules are 
able to address these common problems, but a more 
quantitative notion of “problem coverage” is re- 
quired. One approach we plan to pursue to this end is 
to extract distinguishing attributes of a small set of 
tickets in the categories that are covered by our rules. 
The attributes can include a combination of structured 
fields as well as keywords from the unstructured text. 
Using these distinguishing attributes, we can examine 
a much larger set of tickets and looks for matching 
tickets automatically to calculate the fraction of tickets 
that we expect to be similarly covered by the problem 
determination rules. 


Although our experience with PDA shows prom- 
ise that it can reduce time or effort for diagnosing 
problems, a more comprehensive study is needed to 
gain a sense of how much time or effort can be saved. 
We are making PDA available to system administra- 
tors who support a variety of customer environments 
in order to collect additional experiences, and ideally 
some quantitative data on the savings. We also expect 
to further validate and expand our problem determina- 
tion rules through usage by SAs. Since there is no ef- 
fective way to document problem resolution experi- 
ence, our rule and authoring mechanism are good can- 
didate for such a purpose. 


Summary 


This paper describes the Problem Determination 
Advisor, a tool for automating the problem determina- 
tion process. Based on a study of problem tickets from 
a large enterprise IT support organization, we identi- 
fied commonly occurring server problems and devel- 
oped a set of problem determination rules to aid in 
their diagnosis. We implement these rules in the PDA 
tool using a two-level approach in which high-level 
system health monitors trigger lower-level diagnostic 
probes to collect relevant details when a problem is 
detected. We demonstrated the effectiveness of PDA 
in problem diagnosis using a number of actual prob- 
lem scenarios. 
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ABSTRACT 


Usher is a virtual machine management system designed to impose few constraints upon the 
computing environment under its management. Usher enables administrators to choose how their 
virtual machine environment will be configured and the policies under which they will be managed. 
The modular design of Usher allows for alternate implementations for authentication, authorization, 
infrastructure handling, logging, and virtual machine scheduling. The design philosophy of Usher is 
to provide an interface whereby users and administrators can request virtual machine operations 
while delegating administrative tasks for these operations to modular plugins. Usher’s implementa- 
tion allows for arbitrary action to be taken for nearly any event in the system. Since July 2006, Usher 
has been used to manage virtual clusters at two locations under very different settings, demonstrating 
the flexibility of Usher to meet different virtual machine management requirements. 


Introduction 


Usher is a cluster management system designed 
to substantially reduce the administrative burden of 
managing cluster resources while simultaneously im- 
proving the ability of users to request, control, and 
customize their resources and computing environment. 
System administrators of cluster computing environ- 
ments face a number of imposing challenges. Different 
users within a cluster can have a wide range of com- 
puting demands, spanning general best-effort comput- 
ing needs, batch scheduling systems, and complete 
control of dedicated resources. These resource de- 
mands vary substantially over time in response to 
changes in workload, user base, and failures. Further- 
more, users often need to customize their operating 
system and application environments, substantially in- 
creasing configuration and maintenance tasks. Finally, 
clusters rarely operate in isolated administrative envi- 
ronments, and must be integrated into existing authen- 
tication, storage, network, and host address and name 
service infrastructure. 


Usher balances these imposing requirements us- 
ing a combination of abstraction and architecture. 
Usher provides a simple abstraction of a logical clus- 
ter of virtual machines, or virtual cluster. Usher users 
can create any number of virtual clusters (VCs) of ar- 
bitrary size, while Usher multiplexes individual virtual 
machines (VMs) on available physical machine hard- 
ware. By decoupling logical machine resources from 
physical machines, users can create and use machines 
according to their needs rather than according to as- 
signed physical resources. 


Architecturally, Usher is designed to impose few 
constraints upon the computing environment under its 
management. No two sites have identical hardware 
and software configurations, user and application 


requirements, or service infrastructures. To facilitate 
its use in a wide range of environments, Usher com- 
bines a core set of interfaces that implement basic 
mechanisms, clients for using these mechanisms, and 
a framework for expressing and customizing adminis- 
trative policies in extensible modules, or plugins. 


The Usher core implements basic virtual cluster 
and machine management mechanisms, such as creat- 
ing, destroying, and migrating VMs. Usher clients use 
this core to manipulate virtual clusters. These clients 
serve as interfaces to the system for users as well as 
for use by higher-level cluster software. For example, 
an Usher client called ush provides an interactive 
command shell for users to interact with the system. 
We have also implemented an adapter for a high-level 
execution management system [6], which operates as 
an Usher client, that creates and manipulates virtual 
clusters on its own behalf. 


Usher supports customizable modules for two 
important purposes. First, these modules enable Usher 
to interact with broader site infrastructure, such as au- 
thentication, storage, and host address and naming ser- 
vices. Usher implements default behavior for common 
situations, e.g., newly created VMs in Usher can use a 
site’s DHCP service to obtain addresses and domain 
names. Additionally, sites can customize Usher to im- 
plement more specialized policies; at UCSD, an Usher 
VM identity module allocates IP address ranges to 
VMs within the same virtual cluster. 


Second, pluggable modules enable system admin- 
istrators to express site-specific policies for the place- 
ment, scheduling, and use of VMs. As a result, Usher al- 
lows administrators to decide how to configure their vir- 
tual machine environments and determine the appropri- 
ate management policies. For instance, to support a gen- 
eral-purpose computing environment, administrators can 
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install an available Usher scheduling and placement 
plugin that performs round-robin placement of VMs 
across physical machines and simple rebalancing in re- 
sponse to the addition or removal of virtual and physical 
machines. With this plugin, users can dynamically add 
or remove VMs from VCs at any time without having to 
specify service level agreements (SLAs) [9, 17, 22], 
write configuration files [10], or obtain leases on re- 
sources [12, 14]. With live migration of VMs, Usher can 
dynamically and transparently adjust the mapping of 
virtual to physical machines to adapt to changes in load 
among active VMs or the working set of active VMs, 
exploit affinities among VMs (e.g., to enhance physical 
page sharing [20]), or add and remove hardware with 
little or no interruption. 


Usher enables other powerful policies to be ex- 
pressed, such as power management (reduce the num- 
ber of active physical machines hosting virtual clus- 
ters), distribution (constrain virtual machines within a 
virtual cluster to run on separate nodes), and resource 
guarantees. Another installation of Usher uses its clus- 
ter to support scientific batch jobs running within vir- 
tual clusters, guarantees resources to those jobs when 
they run, and implements a load-balancing policy that 
migrates VMs in response to load spikes [13]. 


Usher is a fully functional system. It has been in- 
stalled in cluster computing environments at UCSD 
and the Russian Research Center in Kurchatov, Rus- 
sia. At UCSD, Usher has been in production use since 
January 2007. It has managed up to 142 virtual ma- 
chines in 26 virtual clusters across 25 physical ma- 
chines. The Usher implementation is sufficiently reli- 
able that we are now migrating the remainder of our 
user base from dedicated physical machines to virtual 
clusters, and Usher will soon manage all 130 physical 
nodes in our cluster. In the rest of this paper we de- 
scribe the design and implementation of Usher, as well 
as Our experiences using it. 


Related Work 


Since the emergence of widespread cluster com- 
puting over a decade ago [8, 16], many cluster config- 
uration and management systems have been developed 
to achieve a range of goals. These goals naturally in- 
fluence individual approaches to cluster management. 
Early configuration and management systems, such as 
Galaxy [19], focus on expressive and scalable mecha- 
nisms for defining clusters for specific types of ser- 
vice, and physically partition cluster nodes among 
those types. 


More recent systems target specific domains, 
such as Internet services, computational grids, and ex- 
perimental testbeds, that have strict workload or re- 
source allocation requirements. These systems support 
services that express explicit resource requirements, 
typically in some form of service level agreement 
(SLA). Services provide their requirements as input to 
the system, and the system allocates its resources 


McNett, et al. 


among services while satisfying the constraints of the 
SLA requirements. 


For example, Océano provides a computing utility 
for e-commerce [9]. Services formally state their work- 
load performance requirements (e.g., response time), 
and Océano dynamically allocates physical servers in 
response to changing workload conditions to satisfy 
such requirements. Rocks and Rolls provide scalable 
and customizable configuration for computational grids 
[11, 18], and Cluster-on-Demand (COD) performs re- 
source allocation for computing utilities and computa- 
tional grid services [12]. COD implements a virtual 
cluster abstraction, where a virtual cluster is a disjoint 
set of physical servers specifically configured to the re- 
quirements of a particular service, such as a local site 
component of a larger wide-area computational grid. 
Services specify and request resources to a site manager 
and COD leases those resources to them. Finally, Emu- 
lab provides a shared network testbed in which users 
specify experiments [21]. An experiment specifies net- 
work topologies and characteristics as well as node 
software configurations, and Emulab dedicates, iso- 
lates, and configures testbed resources for the duration 
of the experiment. 


The recent rise in virtual machine monitor (VMM) 
popularity has naturally led to systems for configuring 
and managing virtual machines. For computational 
grid systems, for example, Shirako extends Cluster- 
on-Demand by incorporating virtual machines to fur- 
ther improve system resource multiplexing while sat- 
isfying explicit service requirements [14], and VIO- 
LIN supports both intra-and inter-domain migration to 
satisfy specified resource utilization limits [17]. Sand- 
piper develops policies for detecting and reacting to 
hotspots in virtual cluster systems while satisfying ap- 
plication SLAs [22], including determining when and 
where to migrate virtual machines, although again un- 
der the constraints of meeting the stringent SLA re- 
quirements of a data center. 


On the other hand, Usher provides a framework 
that allows system administrators to express site-spe- 
cific policies depending upon their needs and goals. 
By default, the Usher core provides, in essence, a gen- 
eral-purpose, best-effort computing environment. It 
imposes no restrictions on the number and kind of vir- 
tual clusters and machines, and performs simple load 
balancing across physical machines. We believe this 
usage model is important because it is widely applica- 
ble and natural to use. Requiring users to explicitly 
specify their resource requirements for their needs, for 
example, can be awkward and challenging since users 
often do not know when or for how long they will 
need resources. Further, allocating and reserving re- 
sources can limit resource utilization; guaranteed re- 
sources that go idle cannot be used for other purposes. 
However, sites can specify more elaborate policies in 
Usher for controlling the placement, scheduling, and 
migration of VMs if desired. Such policies can range 


168 21st Large Installation System Administration Conference (LISA ’07) 


McNett, et al. 


from batch schedulers to allocation of dedicated physi- 
cal resources. 


In terms of configuration, Usher shares many of 
the motivations that inspired the Manage Large Net- 
works (MLN) tool [10]. The goal of MLN is to enable 
administrators and users to take advantage of virtual- 
ization while easing administrator burden. Administra- 
tors can use MLN to configure and manage virtual 
machines and clusters (distributed projects), and it 
supports multiple virtualization platforms (Xen and 
User-Mode Linux). MLN, however, requires adminis- 
trators to express a number of static configuration de- 
cisions through configuration files (e.g., physical host 
binding, number of virtual hosts), and supports only 
coarse granularity dynamic reallocation (manually by 
the administrator). Usher configuration is interactive 
and dynamic, enables users to create and manage their 
virtual clusters without administrative intervention, 
and enables a site to globally manage all VMs accord- 
ing to cluster-wide policies. 


XenEnterprise [5] from XenSource and Virtual- 
Center [4] from VMware are commercial products for 
managing virtual machines on cluster hardware from 
the respective companies. XenEnterprise provides a 
graphical administration console, Virtual Data Center, 
for creating, managing, and monitoring Xen virtual ma- 
chines. VirtualCenter monitors and manages VMware 
virtual machines on a cluster as a data center, support- 
ing VM restart when nodes fail and dynamic load bal- 
ancing through live VM migration. Both list interfaces 
for external control, although it is not clear whether ad- 
ministrators can implement arbitrary plugins and poli- 
cies for integrating the systems into existing infrastruc- 
ture, or controlling VMs in response to arbitrary events 
in the system. In this regard, VMWare’s Infrastructure 
Management SDK provides functionality similar to that 
provided by the Usher client API. However, this SDK 
does not provide the tight integration with VMWare’s 
centralized management system that plugins do for the 
Usher system. Also, of course, these are all tied to 
managing a single VM product, whereas Usher is de- 
signed to interface with any virtualization platform 
that exports a standard administrative interface. 


System Architecture 


This section describes the architecture of Usher. 
We start by briefly summarizing the goals guiding our 
design, and then present a high-level overview of the 
system. We then describe the purpose and operation of 
each of the various system components, and how they 
interact with each other to accomplish their tasks. We 
end with a discussion of how the Usher system accom- 
modates software and hardware failures. 


Design Goals 


As mentioned, no two sites have identical hard- 
ware and software configurations, user and application 
requirements, or service infrastructures. As a result, 
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we designed Usher as a flexible platform for con- 
structing virtual machine management installations 
customized to the needs of a particular site. 


To accomplish this goal, we had two design objec- 
tives for Usher. First, Usher maintains a clean separation 
between policy and mechanism. The Usher core pro- 
vides a minimal set of mechanisms essential for virtual 
machine management. For instance, the Usher core has 
mechanisms for placing and migrating virtual machines, 
while administrators can install site-specific policy mod- 
ules that govern where and when VMs are placed. 


Second, Usher is designed for extensibility. The 
Usher core provides three ways to extend functionali- 
ty, as illustrated in Figure 1. First, Usher provides a set 
of hooks to integrate with existing infrastructure. For 
instance, while Usher provides a reference implemen- 
tation for use with the Xen VMM, it is straightforward 
to write stubs for other virtualization platforms. Sec- 
ond, developers can use a Plugin API to enhance Ush- 
er functionality. For example, plugins can provide 
database functionality for persistently storing system 
state using a file-backed database, or provide authenti- 
cation backed by local UNIX passwords. Third, Usher 
provides a Client API for integrating with user inter- 
faces and third-party tools, such as the Usher com- 
mand-line shell and the Plush execution management 
system (discussed in the Applications subsection of 
the Implementation section). 
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Figure 1: Usher interfaces. 
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Usher Overview 


A running Usher system consists of three main 
components: local node managers (LNMs), a central- 
ized controller, and clients. A client consists of an ap- 
plication that utilizes the Usher client library to send 
virtual machine management requests to the controller. 
We have written a few applications that import the 
Usher client library for managing virtual machines (a 
shell and XML-RPC server) with more under develop- 
ment (a web frontend and command line suite). 


Figure 2 depicts the core components of an Ush- 
er installation. One LNM runs on each physical node 
and interacts directly with the VMM to perform man- 
agement operations such as creating, deleting, and mi- 
grating VMs on behalf of the controller. The local 
node managers also collect resource usage data from 
the VMMs and monitor local events. LNMs report re- 
source usage updates and events back to the controller 
for use by plugins and clients. 
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The controller is the central component of the 
Usher system. It receives authenticated requests from 
clients and issues authorized commands to the LNMs. 
It also communicates with the LNMs to collect usage 
data and manage virtual machines running on each 
physical node. The controller provides event notifica- 
tion to clients and plugins registered to receive notifi- 
cation for a particular event (e.g., a VM has started, 
been destroyed, or changed state). Plugin modules can 
perform a wide range of tasks, such as maintaining 
persistent system-wide state information, performing 
DDNS updates, or doing external environment prepa- 
ration and cleanup. 


The client library provides an API for applications 
to communicate with the Usher controller. Essentially, 
clients submit requests to the controller when they need 
to manipulate their VMs or request additional VMs. 
The controller can grant or deny these requests as its 
operational policy dictates. One purpose of clients is to 
serve as the user interface to the system, and users use 
clients to manage their VMs and monitor system state. 
More generally, arbitrary applications can use the client 
library to register callbacks for events of interest in the 
Usher system. 


Typically, a few services also support a running 
Usher system. Depending upon the functionality de- 
sired and the infrastructure provided by a particular 
site, these services might include a combination of the 
following: a database server for maintaining state in- 
formation or logging, a NAS server to serve VM 
filesystems, an authentication server to provide au- 
thentication for Usher and VMs created by Usher, a 
DHCP server to manage IP addresses, and a DNS 
server for name resolution of all Usher created VMs. 
Note that an administrator may configure Usher to use 
any set of support services desired, not necessarily the 
preceding list. 


Usher Components 


As noted earlier, Usher consists of three main 
components, local node managers on each node, a 
central controller, and Usher clients. 


Local Node Managers 

The local node managers (LNMs) operate closest 
to the hardware. As shown in Figure 2, LNMs run as 
servers on each physical node in the Usher system. 
The LNMs have three major duties: i) to provide a re- 
mote API to the controller for managing local VMs, ii) 
to collect and periodically upload local resource usage 
data to the controller, and iii) to report local events to 
the controller. 


Each LNM presents a remote API to the con- 
troller for manipulating VMs on its node. Upon invok- 
ing an API method, the LNM translates the operation 
into the equivalent operation of the VM management 
API exposed by the VMM running on the node. Note 
that all LNM API methods are asynchronous so that 
the controller does not block waiting for the VMM 
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operation to complete. We emphasize that this architec- 
ture abstracts VMM-specific implementations — the 
controller is oblivious to the specific VMMs running 
on the physical nodes as long as the LNM provides the 
remote API implementation. As a result, although our 
implementation currently uses the Xen VMM, Usher 
can target other virtualization platforms. Further, Usher 
is capable of managing VMs running any operating 
system supported by the VMMs under its management. 


As the Usher system runs, VM and VMM re- 
source usage fluctuates considerably. The local node 
manager on each node monitors these fluctuations and 
reports them back to the controller. It reports resource 
usage of CPU utilization, network receive and trans- 
mit loads, disk I/O activity, and memory usage in 1, 5, 
and 15-minute averages. 


In addition to changes in resource usage, VM 
state changes sometimes occur unexpectedly. VMs can 
crash or even unexpectedly appear or disappear from 
the system. Detecting these and other related events 
requires both careful monitoring by the local node 
managers as well as VMM support for internal event 
notification. Administrators can set a tunable parame- 
ter for how often the LNM scans for missing VMs or 
unexpected VMs. The LNM will register callbacks 
with the VMM platform for other events, such as VM 
crashes; if the VMM does not support such callbacks, 
LNM will periodically scan to detect these events. 


Application Application 
eee 


} 
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Figure 2: Usher components. 





Usher Controller 


The controller is the center of the Usher system. 

It can either be bootstapped into a VM running in the 
system, or run on a separate server. The controller pro- 
vides the following: 

e User authentication 

¢ VM operation request API 

© Global state maintenance 

© Consolidation of LNM monitoring data 

e Event notification 


User authentication: Usher uses SSL-encrypted 
user authentication. All users of the Usher system must 
authenticate before making requests of the system. Ad- 
ministrators are free to use any of the included authenti- 
cation modules for use with various authentication 
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backends (e.g., LDAP), or implement their own. An ad- 
ministrator can register multiple authentication mod- 
ules, and Usher will query each in turn. This support is 
useful, for instance, to provide local password authen- 
tication if LDAP or NIS authentication fails. After re- 
ceiving a user’s credentials, the controller checks them 
against the active authentication module chain. If one 
succeeds before reaching the end of the chain, the user 
is authenticated. Otherwise, authentication fails and 
the user must retry. 


VM operation request API: A key component of 
the controller is the remote API for Usher clients. This 
API is the gateway into the system for VM manage- 
ment requests (via RPC) from connecting clients. Typi- 
cally, the controller invokes an authorization plugin to 
verify that the authenticated user can perform the oper- 
ation before proceeding. The controller may also in- 
voke other plugins to do preprocessing such as check- 
ing resource availability and making placement deci- 
sions at this point. Usher calls any plugin modules reg- 
istered to receive notifications for a particular request 
once the controller receives such a request. 


Usher delegates authorization to plugin modules 
so that administrators are free to implement any policy 
or policies they wish and stack and swap modules as 
the system runs. In addition, an administrator can con- 
figure the monitoring infrastructure to automatically 
swap or add policies as the system runs based upon 
current system load, time of day, etc. In its simplest 
form, an authorization policy module openly allows 
users to create and manipulate their VMs as they de- 
sire or view the global state of the system. More re- 
strictive policies may limit the number of VMs a user 
can start, prohibit or limit migration, or restrict what 
information the system returns upon user query. 


Once a request has successfully traversed the au- 
thorization and preprocessing steps, the controller exe- 
cutes it by invoking asynchronous RPCs to each LNM 
involved. As described above, it is up to any plugin pol- 
icy modules to authorize and check resource availabili- 
ty prior to this point. Depending upon the running poli- 
cy, the authorization and preprocessing steps may alter 
a user request before the controller executes it. For ex- 
ample, the policy may be to simply “do the best I can” 
to honor a request when it arrives. If a user requests 
more VMs than allowed, this policy will simply start as 
many VMs as are allowed for this user, and report back 
to the client what action was taken. Finally, if insuffi- 
cient resources are available to satisfy an authorized 
and preprocessed request, the controller will attempt to 
fulfill the request until resources are exhausted. 


Global state maintenance: The controller main- 
tains a few lists which constitute the global state of the 
system. These lists link objects encapsulating state in- 
formation for running VMs, running VMMs, and in- 
stantiated virtual clusters (VCs). A virtual cluster in 
Usher can contain an arbitrary set of VMs, and admin- 
istrators are free to define VCs in any way suitable to 
their computing environment. 
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In addition to the above lists, the controller main- 
tains three other lists of VMs: Jost, missing, and un- 
managed VMs. The subtle distinction between lost 
and missing is that lost VMs are a result of an LNM or 
VMM failure (the controller is unable to make this 
distinction), whereas a missing VM is a result of an 
unexpected VM disappearance as reported by the 
LNM where the VM was last seen running. A missing 
VM can be the result of an unexpected VMM error 
(e.g., we have encountered this case upon a VMM er- 
ror on migration). Unmanaged VMs are typically a re- 
sult of an administrator manually starting a VM on a 
VMM being managed by Usher; Usher is aware of the 
VM, but is not itself managing it. The list of unman- 
aged VMs aids resource usage reporting so that Usher 
has a complete picture of all VMs running on its 
nodes. 


Having the controller maintain system state re- 
moves the need for it to query all LNMs in the system 
for every VM management operation and state query. 
However, the controller does have to synchronize with 
the rest of system and we discuss synchronization fur- 
ther in the Component Interaction subsection below. 


Consolidation of LNM monitoring data: Prop- 
er state maintenance relies upon system monitoring. 
The controller is responsible for consolidating moni- 
toring data sent by the local node managers into a for- 
mat accessible by the rest of the system. Clients use 
this data to describe the state of the system to users, 
and plugins use this data to make policy decisions. For 
example, plugin modules may use this data to restrict 
user resource requests based on the current system 
load, or make VM scheduling decisions to determine 
where VMs should run. 


Event notification: Usher often needs to alert 
clients and plugin modules when various events in the 
system occur. Events typically fall into one of three 
categories: 

¢ VM operation requests 
e VM state changes 
e Errors and unexpected events 


Clients automatically receive notices of state 
changes of their virtual machines. Clients are free to 
take any action desired upon notification, and can 
safely ignore them. Plugin modules, however, must 
explicitly register with the controller to receive event 
notifications. Plugins can register for any type of event 
in the system. For example, a plugin may wish to re- 
ceive notice of VM operation requests for preprocess- 
ing, or error and VM state change events for reporting 
and cleanup. 

Clients and the Client API 

Applications use the Usher client API to interact 
with the Usher controller. This API provides methods 
for requesting or manipulating VMs and performing 
state queries. We refer to any application importing 
this API as a client. 
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The client API provides the mechanism for clients 
to securely authenticate and connect to the Usher con- 
troller. Once connected, an application may call any of 
the methods provided by the API. All methods are asyn- 
chronous, event-based calls to the controller (see the 
Implementation section below). As mentioned above, 
connected clients also receive notifications from the 
controller for state changes to any of their VMs. Client 
applications can have their own callbacks invoked for 
these notifications. 


Component Interaction 


Having described each of the Usher components 
individually, we now describe how they interact in 
more detail. We first discuss how the controller and 
LNMs interact, and then describe how the controller 
and clients interact. Note that clients never directly 
communicate with LNMs; in effect, the controller 
““proxies”’ all interactions between clients and LNMs. 


Controller and LNM Interaction 


When an LNM starts or receives a controller re- 
covery notice, it connects to the controller specified in 
its configuration file. The controller authenticates all 
connections from LNMs, and encrypts the connection 
for privacy. Upon connection to the controller, the 
LNM passes a capability to the controller for access to 
its VM management API. 


Using the capability returned by the LNM, the 
controller first requests information about the hard- 
ware configuration and a list of currently running vir- 
tual machines on the new node. The controller adds 
this information to its lists of running VMs and 
VMMs in the system. It then uses the capability to as- 
sume management of the VMs running on the LNM’s 
node. 


The controller also returns a capability back to 
the LNM. The LNM uses this capability for both event 
notification and periodic reporting of resource usage 
back to the controller. 


When the controller discovers that a new node 
already has running VMs (e.g., because the node’s 
LNM failed and restarted), it first determines if it 
should assume management of any of these newly dis- 
covered VMs. The controller makes this determination 
based solely upon the name of the VM. If the VM 
name ends with the domain name specified in the con- 
troller’s configuration file, then the controller assumes 
it should manage this VM. Any VMs which it should 
not manage are placed on the unmanaged list dis- 
cussed above. For any VMs which the controller 
should manage, the controller creates a VM object in- 
stance and places this object on its running VMs list. 
These instances are sent to the LNMs where the VMs 
are running and cached there. Whenever an LNM sees 
that a cached VM object is inconsistent with the corre- 
sponding VM running there (e.g., the state of the VM 
changed), it alerts the controller of this event. The 
controller then updates the cached object on the LNM. 
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In this way, the update serves as an acknowledgment 
and the LNM knows that the controller received notice 
of the event. 


Similarly, the controller sends VM object in- 
stances for newly created VMs to an LNM before the 
VM is actually started there. Upon successful return 
from a start command, the controller updates the VMs 
cached object state on the LNM. Subsequently, the 
LNM assumes the responsibility for monitoring and 
reporting any unexpected state changes back to the 
controller. 


Controller and Client Interaction 


Clients to the Usher system communicate with 
the controller. Before a client can make any requests, 
it must authenticate with the controller. If authentica- 
tion succeeds, the controller returns a capability to the 
client for invoking its remote API methods. Clients 
use this API to manipulate VMs. 


Similar to the local node managers, clients re- 
ceive cached object instances corresponding to their 
VMs from the controller upon connection. If desired, 
clients can filter this list of VMs based upon virtual 
cluster grouping to limit network traffic. The purpose 
of the cached objects at the client is twofold. First, 
they provide a convenient mechanism by which clients 
can receive notification of events affecting their VMs, 
since the controller sends updates to each cached VM 
object when the actual VM is modified. Second, the 
cached VM objects provide state information to the 
clients when they request VM operations. With this or- 
ganization, clients do not have to query the controller 
about the global state of the system before actually 
submitting a valid request. For example, a client 
should not request migration of a non-existent VM, or 
try to destroy a VM which it does not own. The client 
library is designed to check for these kinds of condi- 
tions before submitting a request. Note that the con- 
troller is capable of handling errant requests; this 
scheme simply offloads request filtering to the client. 


The controller is the authority on the global state 
of the system. When the controller performs an action, 
it does so based on what it believes is the current glob- 
al state. The cached state at the client reflects the con- 
troller’s global view. For this reason, even if the con- 
troller is in error, its state is typically used by clients 
for making resource requests. The controller must be 
capable of recovering from errors due to inconsisten- 
cies between its own view of the global state of the 
system and the actual global state. These inconsisten- 
cies are typically transient (e.g., a late event notifica- 
tion from an LNM), in which case the controller may 
log an error and return an error message to the client. 
Failures 

As the Usher system runs, it is possible for the 
controller or any of the local node managers to become 
unavailable. This situation could be the result of hard- 
ware failure, operating system failure, or the server itself 
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failing. Usher has been designed to handle these failures 
gracefully. 


In the event of a controller failure, the LNMs 
will start a listening server for a recovery announce- 
ment sent by the controller. When the controller 
restarts, it sends a recovery message to all previously 
known LNMs. When the LNMs receive this an- 
nouncement, they reconnect to the controller. As men- 
tioned in the Controller and LNM Interaction section 
above, when an LNM connects to the controller, it 
passes information about its physical parameters and 
locally running VMs. With this information from all 
connecting LNMs, the controller recreates the global 
state of the system. With this design, Usher only re- 
quires persistent storage of the list of previously 
known LNMs rather than the entire state of the system 
to restore system state upon controller crash or failure. 


Since the controller does not keep persistent in- 
formation about which clients were known to be con- 
nected before a failure, it cannot notify clients when it 
restarts. Instead, clients connected to a controller 
which fails will attempt to reconnect with timeouts 
following an exponential backoff. Once reconnected, 
clients flush their list of cached VMs and receive a 
new list from the controller. 


The controller detects local node manager fail- 
ures upon disconnect or TCP timeout. When this situa- 
tion occurs, the controller changes the state of all VMs 
known to be running on the node with the failed LNM 
to /ost. It makes no out of band attempts to determine 
if lost VMs are still running or if VMMs on which 
LNMs have failed are still running. The controller 
simply logs an error, and relies upon the Usher admin- 
istrator or a recovery plugin to investigate the cause of 
the error. 


Implementation 


In this section we describe the implementation of 
Usher, including the interfaces that each component 
supports and the plugins and applications currently 
implemented for use with the system. 


Component 


LNM (w/ Xen hooks) 907 
Controller 1703 


Client API 750 
Utilities 633 
Ush 1099 





Table 1: Code size of individual components. 


The main Usher components are written in Python 
[2]. In addition, Usher makes use of the Twisted net- 
work programming framework [3]. Twisted provides 
convenient mechanisms for implementing event based 
servers, asynchronous remote procedure calls, and re- 
mote object synchronization. Table 1 shows source 
code line counts for the main Usher components, for 
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total of 3993 lines of code. Also included is the line 
count for the ush application (over 400 of which is 
simply online documentation). 
Local Node Managers 

Local Node Managers export the remote API 
shown in Table 2 to the controller. This API is made 
available to the controller via a capability passed to 
the controller when an LNM connects to it. 


Method Name Description 
get_details(vm name) 


get VM state infor- 
mation 


get_status(vm name) — get VM resource us- 


age statistics 


receive new cached 
VM object 


start(vm name) start cached VM 


op_on(operation, operate on existing 
vm name) VM 

migrate(vm name, migrate VM to LNM 
Inm name) 


get_node_info() 


receive(vm instance) 


get node physical 
characteristics 


get_node_status() get node dynamic 
and resource usage 


info 





Table 2: Local node manager remote API. 


This API includes methods to query for VM state 
information and VM resource usage details using the 
get_details and get_status methods, respectively. State 
information includes run state, memory allocation, IP 
and MAC addresses, the node on which VM is run- 
ning, VM owner, etc. Resource usage includes 1, 5, 
and 15-minute utilizations of the various hardware re- 
sources. 


The receive method creates a cached copy of a 
VM object on an LNM. An LNM receives the cached 
copy when it connects to the controller. It compares 
the state of the VM object with the actual state of the 
virtual machine. If the states differ the LNM notifies 
the controller, which updates the LNM’s cached copy 
of the VM as an acknowledgment that it received the 
state change notice. 


In addition, the cached copy of a VM at its LNM 
contains methods for manipulating the VM it repre- 
sents. When a VM manipulation method exposed by 
the LNM’s API is invoked (one of start, op_on, or mi- 
grate), the method calls the corresponding method of 
the cached VM object to perform the operation. This 
structure provides a convenient way to organize VM 
operations. To manipulate a VM, a developer simply 
calls the appropriate method of the cached VM object. 
Note that the controller must still update the state of 
its VM object as an acknowledgment that the con- 
troller knows the operation was successful. 


21st Large Installation System Administration Conference (LISA ’07) 173 


Usher: An Extensible Framework for Managing Clusters... 


Most operations on an existing VM are encapsu- 
lated in the op_on function, and have similar signa- 
tures. Table 3 shows the list of valid operations to the 
op_on method. 






Operation Description 


pause pause VM execution, keeping 
memory image resident 


resume resume execution of a paused VM 


shutdown __ nicely halta VM 


reboot shutdown and restart VM 


hibernate save VM’s memory image to per- 
sistent storage 


restore restore hibernated VM to run state 
destroy hard shutdown a VM 
cycle destroy and restart a VM 


Table 3: Operations supported by the op_on method. 

















All VM operations invoke a corresponding oper- 
ation in the VMM’s administration API. Though Ush- 
er currently only manages Xen VMs, it is designed to 
be VMM-agnostic. An installation must provide an 
implementation of Usher’s VMM interface to support 
new virtual machine managers. 


The LNM’s remote API exposes a few methods 
that do not operate on VMs. The get_node_info method 
returns hardware characteristics of the physical ma- 
chine. The controller calls this method when an LNM 
connects. The get_node_status method is similar to the 
get_status method. Additionally, it reports the number 
of VMs running on the VMM and the amount of free 
memory on the node. 


Usher Controller 


The remote API exported by the controller to 
connecting clients closely resembles the interface ex- 
ported by LNMs to the controller. Table 4 lists the 
methods exported by the controller to Usher clients. 
This API is made available to clients via a capability 
passed upon successful authentication with the con- 
troller. 


Note that most of these methods operate on lists 
of VMs, rather than single VMs expected by the LNM 
API methods. Since Usher was designed to manage 
clusters, the common case is to invoke these methods 
on lists of VMs rather than on a single VM at a time. 
This convention saves significant call overhead when 
dealing with large lists of VMs. 


The start and migrate methods both take a list of 
LNMs. For start, the list specifies the LNMs on which 
the VMs should be started. An empty list indicates 
that the VMs can be started anywhere. Recall that this 
parameter is simply a suggestion to the controller. 
Policies installed in the controller dictate whether or 
not the controller will honor the suggestion. Likewise, 
the LNM list passed to the migrate method is simply a 
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suggestion to the controller as to where to migrate the 
VMs. The controller can choose to ignore this sugges- 
tion or ignore the migrate request altogether based up- 
on the policies installed. 


The operations supported by the op_on method in 
the controller API are the same as those to the op_on 
method of the remote LNM API (Table 3). 


Method Name 
list(vm list, status) 


Description 


list state and resource 
usage information for 
VMs 


list LNMs and resource 
usage information for 
VMMs 


start list of VMs on 
LNMs 


Op_on(operation, vm list) operate on existing VMs 


migrate(vm list, Inm list) migrate VMs to LNMs 


Table 4: Controller remote API for use by Usher 
clients. 


Client API 


The client API closely mirrors that of the con- 
troller. An important difference between these two 
APIs, though, is that the client API signatures contain 
many additional parameters to aid in working with 
large sets of VMs. These additional parameters allow 
users to operate on arbitrary sets of VMs and virtual 
clusters in a single method call. The API supports spec- 
ifying VM sets as regular expressions, explicit lists, or 
ranges (when VM names contain numbers). The client 
API also allows users to specify source and destination 
LNMs using regular expressions or explicit lists. 


Another difference between the client and con- 
troller APIs is that the client API expands the op_on 
method into methods for each type of operation. Ex- 
plicitly enumerating the operations as individual meth- 
ods avoids confusing application writers unfamiliar 
with the op_on method. These methods simply wrap 
the call to the op_on method, which is still available 
for those wishing to call it directly. 


Finally, the client API contains connect and recon- 
nect methods. These methods contact and authenticate 
with the controller via SSL. They also start the client’s 
event loop to handle cached object updates and results 
from asynchronous remote method calls. The reconnect 
method is merely a convenience method to avoid hav- 
ing to pass credentials to the API if a reconnect is re- 
quired after having been successfully connected. This 
method can be used by a reconnecting application up- 
on an unexpected disconnect. 


Configuration Files 


All Usher components are capable of reading 
configuration data from text files. All valid configura- 
tion parameters, their type, and default values are 
























list_Inms(Inm list, status) 










start(vm list, Inm list) 
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specified in the code for each component. When each 
component starts, it first parses its configuration files 
(if found). The configuration system tries to read in 
values from the following locations (in order): a de- 
fault location in the host filesystem, a default location 
in the user’s home directory, and finally a file indicat- 
ed by an environment variable. This search ordering 
enables users to override default values easily. Values 
read in later configuration files replace values speci- 
fied in a previously read file. 
Plugins 

Plugins are separate add-on modules which can 
be registered to receive notification of nearly any 
event in the system. Plugins live in a special directory 
(aptly named “plugins’’) of the Usher source tree. 
Usher also looks in a configurable location for third- 
party/user plugins. Any plugins found are automatical- 
ly sourced and added to a list of available plugins. To 
register a plugin, the controller provides an additional 
API call register_plugin(plugin name, event, configuration 
file). Each plugin is required to provide a method 
named entry_point to be called when an event fires for 
which it is registered. It is possible to add a single 
plugin to multiple event handler chains. Note that the 
register_plugin method can be called from anywhere in 
the Usher code. 


By default, plugins for each event are simply 
called in the order in which they are registered. There- 
fore careful consideration must be given to ordering 
while registering plugins. A plugin’s configuration ob- 
ject can optionally take an order parameter that gov- 
erns the order in which plugins are called on the 
event’s callback list. The plugin API also provides a 
converse unregister_plugin call to change event han- 
dling at runtime. 


Plugins can be as simple or complex as neces- 
sary. Since the controller invokes plugin callback 
chains asynchronously, complex plugins should not in- 
terfere with the responsiveness of the Usher system 
(i.e., the main controller event loop will not block 
waiting for a plugin to finish its task). 


Policies in an Usher installation are implemented 
as plugins. As an example, an administrator may have 
strict policies regarding startup and migration of virtu- 
al machines. To enforce these policies, a plugin (or 
plugins) is written to authorize start and migrate re- 
quests. This plugin gets registered for the start_request 
and migrate_request events, either manually using the 
controllers register_plugin command, or by specifying 
these registrations in the controller’s configuration 
file. Once registered, subsequent start and migrate re- 
quests are passed to the plugin (in the form of a Re- 
quest object) for authorization. At this point, the 
plugin can approve, approve with modification, or 
simply reject the request. Once this is done, the re- 
quest is passed on to any other plugins registered on 
the start_request or migrate_request event lists with a 
higher order attribute. 
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Besides authorization policies, one can imagine 
policies for VM operation and placement. For exam- 
ple, initial VM placement, VM scheduling (i.e., dy- 
namic migration based on load or optimizing a utility 
function), or reservations. A policy plugin for initial 
placement would be registered for the start_request 
event (probably with a higher order attribute than the 
startup authorization policy discussed above so that it 
is called later in the plugin callback chain). Some sim- 
ple policies such a plugin might support are round- 
robin and least-loaded. Scheduling and reservation 
plugins could be registered with a timer to be fired pe- 
riodically to evaluate the state of the system and make 
decisions about where VMs should be migrated and 
which VMs might have an expired reservation, respec- 
tively. 

As a concrete example of plugin usage in Usher, 
we now discuss plugins implemented for use by the 
UCSD Usher installation, and outline the sequence of 
events for a scenario of starting a set of VMs. Detailed 
discussion about these plugins is deferred to the 
UCSD SysNet subsection of the Usher Deployments 
section. 


The UCSD installation uses the following plug- 
ins: an SQL database plugin for logging, mirroring 
global system state, and IP address management; an 
LDAP plugin for user authentication for both Usher 
and VMs created by Usher; a filesystem plugin for 
preparing writable VM filesystems; a DNS plugin for 
modifying DNS entries for VMs managed by Usher; 
and a default placement plugin to determine where 
VMs should be started. We are developing additional 
modules for VM scheduling as part of ongoing re- 
search. 

All plugins for the UCSD installation are written 
in Python. Table 5 contains line counts for these plug- 
ins. Overall, the UCSD plugins total 1406 lines of 
code. 


Plugin 
Database 260 
LDAP 870 


Filesystem 54 
DNS 90 


Placement 132 
Table 5: Code size of UCSD plugins. 





When a request to start a list of VMs arrives, the 
controller calls the modules registered for the “‘start 
request” event. The placement and database modules 
are in the callback list for this event. The placement 
module first determines which VMs to start based on 
available resources and user limits, then determines 
where each of the allowed VMs will start. The data- 
base module receives the modified request list, logs 
the request, then reserves IP addresses for each of the 
new VMs. 
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The controller generates a separate VM start 
command for each VM in the start list. Prior to invok- 
ing the start command, the controller triggers a ‘“‘regis- 
ter VM” event. The database, DNS, and file system 
plugin modules are registered for this event. The data- 
base module adds the VM to a ““VMs” table to mirror 
the fact that this is now a VM included in the con- 
troller’s view of the global state. The DNS plugin sim- 
ply sends a DDNS update to add an entry for this VM 
in our DNS server. The filesystem module prepares 
mount points on an NFS server for use by each VM. 


Finally, upon return from each start command, a 
“start complete” event fires. The database module 
registers to receive this event. The database module 
checks the result of the command, logs this to the 
database, then either marks the corresponding IP ad- 
dress as used (upon success) or available (upon fail- 
ure). Note that the database module does not change 
the state of the VM in the VMs table until receiving a 
“state changed” event for this VM (which originates 
at the LNM). 


Sites can install or customize plugins as neces- 
sary. The Usher system supports taking arbitrary input 
and passing it to plugins on request events. For exam- 
ple, a request to start a VM may include information 
about which OS to boot, which filesystem to mount, 
etc. Plugin authors can use this mechanism to com- 
pletely customize VMs at startup. 


Applications 


We have written two applications using the client 
API, a shell named Ush and an XML-RPC server 
named plusher, and are developing two other applica- 
tions, a Web interface and a command-line suite. The 
Web interface will provide a convenient point and 
click interface accessible from any computer with a 
web browser. The command line suite will facilitate 
writing shell scripts to manage virtual machines. 


Ush Client 

The Usher shell Ush provides an interactive 
command-line interface to the API exported by the 
Usher controller. Ush provides persistent command 
line history and comes with extensive online help for 
each command. If provided, Ush can read connection 
details and other startup parameters from a configura- 
tion file. Ush is currently the most mature and pre- 
ferred interface for interacting with the Usher system. 


As an example of using the Usher system, we de- 
scribe a sample Ush session from the UCSD Usher in- 
stallation, along with a step-by-step description of ac- 
tions taken by the core components to perform each 
request. In this example, user “mmcnett” requests ten 
VMs. Figure 3 contains a snapshot of Ush upon com- 
pletion of the start command. 


First a user connects to the Usher controller by 
running the “connect”? command. In connecting, the 
controller receives the user’s credentials and checks 
them against the LDAP database. Once authentication 





McNett, et al. 


succeeds, the controller returns a capability for its re- 
mote API and all of user mmcnett’s VMs. The some- 
what unusual output “<Command 0 result pending...>” 
reflects the fact that all client calls to the controller are 
asynchronous. When “connect” returns, Ush responds 
with the “Command 0 result:”” message followed by the 
actual result ‘‘Connected”’. 


Upon connecting Ush saves the capability and 
cached VM instances sent by the controller. Once con- 
nected, the user runs the “list”? command to view his 
currently running VMs. Since the client already has 
cached instances of user mmcnett’s VMs, the list com- 
mand does not invoke any remote procedures. Conse- 
quently, Ush responds immediately indicating that us- 
er mmcnett already has two VMs running. 


The user then requests the start of ten VMs in the 
““sneetch”’ cluster. In this case, the -n argument speci- 
fies the name of a cluster, and the -c argument speci- 
fies how many VMs to start in this cluster. When the 
controller receives this request, it first calls on the au- 
thorization and database modules to authorize the re- 
quest and reserve IP addresses for the VMs to be start- 
ed. Next, the controller calls the initial placement 
plugin to map where the authorized VMs should be 
started. The controller calls the start method of the re- 
mote LNM API at each new VM’s LNM. The LNMs 
call the corresponding method of the VMM admini- 
stration API to start each VM. Upon successful return 
of all of these remote method calls, the controller re- 
sponds to the client that the ten VMs were started in 
two seconds and provides information about where 
each VM was started. After completing their boot se- 
quence, user mmcnett can ssh into any of his new 
VMs by name. 


mmenett@5:~ (on 5.sysnet.usher.ucsdsys.net) 
Usher Shell 0,2 
Tupe '7' or ‘help’ for help 
Use: help <command> for command specific help 
ush> connect 
Password: 
<Comrand 0 result pending...> 
Conrmand 0 result: 
Connected 
racnett:ush> list 


state 


horton, macnett .usher .ucsdsys .net 

usher ,macnett .usher .ucsdsys .net 
racnatt:ush> start -n sneatch -c 10 
<Command 1 result pending...> 

Command 1 result: 

Controller started 10 VMs in 2 seconds: 
1.sneetch.macnett usher .ucsdsys.net started on vand3 usher .ucsdsys .net 
10, sneetch.mmcnett .usher ,ucsdsys.net started on vam44 usher .ucsdsys net 
2.sneetch.amcnett .usher.ucsdsys.net started on vam76.usher .ucsdsys .net 
3.sneetch.amcnett .usher .ucsdsys.net started on vam74 usher .ucsdsys .net 
4. sneetch.ancnett .usher.ucsdsys.net started on van43 .usher .ucsdsus ,net 
5,sneetch amcnett .usher .ucsdsys.net started on var60.usher ,ucsdsys ,net 


van47 ,usher .ucsdsys .net 
vano2 usher .ucsdsys .net 


6,sneetch. amcnett .usher ,ucsdsys.net started on vam71,usher ,ucsdsys net 
7,.smeetch,amcnett ,usher ,ucsdsys.net started on van46.usher ,ucsdsus ,net 
8. sneetch.amcnett .usher ,ucsdsys.net started on vam78 ,usher .ucsdsys ,net 
9.sneetch.arcnett .usher.ucsdsys.net started on vaa72.usher.ucsdsus .net 
macnett sush> 


Figure 3: Ush 


Plusher 

Plush [7] is an extensible execution management 
system for large-scale distributed systems, and plusher 
is an XML-RPC server that integrates Plush with Ush- 
er. Plush users describe batch experiments or compu- 
tations in a domain-specific language. Plush uses this 
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input to map resource requirements to physical re- 
sources, bind a set of matching physical resources to 
the experiment, set up the execution environment, and 
finally execute, monitor and control the experiment. 


Since Usher is essentially a service provider for 
the virtual machine “‘resource”’, it was natural to inte- 
grate it with Plush. This integration enables users to 
request virtual machines (instead of physical ma- 
chines) for running their experiments using a familiar 
interface. 


Developing plusher was straightforward. Plush al- 
ready exports a simple control interface through XML- 
RPC to integrate with resource providers. Plush re- 
quires providers to implement a small number of up- 
calls and down-calls. Up-calls allow resource providers 
to notify Plush of asynchronous events. For example, 
using down-calls Plush requests resources asynchro- 
nously so that it does not have to wait for resource allo- 
cation to complete before continuing. When the pro- 
vider finishes allocating resources, it notifies Plush us- 
ing an up-call. 

To integrate Plush and Usher in plusher, we only 
needed to implement stubs for this XML-RPC inter- 
face in Usher. The XML-RPC stub uses the Client API 
to talk to the Usher controller. The XML-RPC stub 
acts as a proxy for authentication — it relays the au- 
thentication information (provided by users to Plush) 
to the controller before proceeding. When the request- 
ed virtual machines have been created, plusher returns 
a list of IP addresses to Plush. If the request fails, it re- 
turns an appropriate error message. 


Usher Deployments 


Next we describe two deployments of Usher that 
are in production use at different sites. The first de- 
ployment is for the UCSD CSE Systems and Network- 
ing research group, and the second deployment is at 
the Russian Research Center, Kurchatov Institute 
(RRC-KI). The two sites have very different usage 
models and computing environments. In describing 
these deployments, our goal is to illustrate the flexibil- 
ity of Usher to meet different virtual machine manage- 
ment requirements and to concretely demonstrate how 
sites can extend Usher to achieve complex manage- 
ment goals. Usher does not force one to setup or man- 
age their infrastructure as done by either of these two 
installations. 


UCSD SysNet 


The UCSD CSE Systems and Networking (Sys- 
Net) research group has been using Usher experimen- 
tally since June 2006 and for production since January 
2007. The group consists of nine faculty, 50 graduate 
students, and a handful of research staff and undergrad- 
uate student researchers. The group has a strong focus 
on experimental networking and distributed systems re- 
search, and most projects require large numbers of ma- 
chines in their research. As a result, the demand for 
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machines far exceeds the supply of physical machines, 
and juggling physical machine allocations never satis- 
fies all parties. However, for most of their lifetimes, vir- 
tual machines can satisfy the needs of nearly all 
projects: resource utilization is bursty with very low 
averages (1% or less), an ideal situation for multiplex- 
ing; virtualization overhead is an acceptable trade-off 
to the benefits Usher provides; and users have com- 
plete control over their clusters of virtual machines, 
and can fully customize their machine environments. 
Usher can also isolate machines, or even remove them 
from virtualization use, for particular circumstances 
(e.g., obtaining final experimental results for a paper 
deadline) and simply place them back under Usher 
management when the deadline passes. 


At the time of this writing, we have staged 25 
physical machines from our hardware cluster into 
Usher. On those machines, Usher has multiplexed up 
to 142 virtual machines in 26 virtual clusters, with an 
average of 63 VMs active at any given time. Our Ush- 
er controller runs on a Dell PowerEdge 1750 with a 
2.8 GHz processor and 2 GB of physical memory. 
This system easily handles our workload. Although 
load is mostly dictated by plugin complexity, using the 
plugins discussed below, the Usher controller con- 
sumes less than 1 percent CPU on average (managing 
75 virtual machines) with a memory footprint of ap- 
proximately 20 MB. The Usher implementation is suf- 
ficiently reliable that we are now migrating the re- 
mainder of our user base from dedicated physical ma- 
chines to virtual clusters, and Usher will soon manage 
all 130 physical nodes in our cluster. 

Usher Usage 

The straightforward ability to both easily create 
arbitrary numbers of virtual machines as well as de- 
stroy them has proved to be very useful, and the Sys- 
Net group has used this capability in a variety of ways. 
As expected, this ability has greatly eased demand for 
physical machines within the research group. Projects 
simply create VMs as necessary. Usher has also been 
used to create clusters of virtual machines for students 
in a networking course; each student can create a clus- 
ter on demand to experiment with a distributed proto- 
col implementation. The group also previously re- 
served a set of physical machines for general login ac- 
cess (as opposed to reserved use by a specific research 
project). With Usher, a virtual cluster of convenience 
VMs now serves this purpose, and an alias with 
round-robin DNS provides a logical machine name for 
reference while distributing users among the VMs up- 
on login. Even mundane tasks, such as experimenting 
with software installations or configurations, can ben- 
efit as well because the cost of creating a new machine 
is negligible. Rather than having to undo mistakes, a 
user can simply destroy a VM with an aborted config- 
uration and start from scratch with a new one. 


The SysNet group currently uses a simple policy 
module in Usher to determine the scheduling and 


21st Large Installation System Administration Conference (LISA ’07) 177 


Usher: An Extensible Framework for Managing Clusters... 


placement of VMs. This module relies upon monitor- 
ing data collected by the controller to make its deci- 
sions. It uses heuristics to place new VMs on lightly 
loaded physical machines, and to migrate VMs when a 
particular VM imposes sustained high load on a physi- 
cal machine. Users are reasonably self-policing; they 
could always create large numbers of VMs to fully 
consume system resources, for example, but in prac- 
tice do not. Eventually, as the utilization of physical 
machines increases to the point where VMs substan- 
tially interfere with each other, the group will interpret 
it as a signal that it is time to purchase additional hard- 
ware for the cluster. 


This policy works well for the group, but of 
course is not necessarily suitable for all situations, 
such as the RRC-KI deployment described below in 
the RRC-KI section. 

Support Services 

Usher at UCSD uses plugins to automatically as- 
signs IP addresses and VLANs to VMs, creates conve- 
nient domain name groupings for VMs in a virtual 
cluster, installs default user accounts, and provides 
structured VM-local, VC-global, and system-global 
file system access. These plugins interact with four 
support servers running as part of the site infrastruc- 
ture. 


SQL Server: The global state of the SysNet in- 
stallation is kept in an SQL backing database. The 
database plugin mentioned in the Plugins subsection 
of the Implementation section provides access to the 
SQL database. Though most of the stored data is log- 
ging data stored for offline analysis of system perfor- 
mance and behavior, the SQL database does provide 
one required service: IP address management. The 
SysNet installation does not use DHCP to manage IP 
address ranges. The SysNet group manages several 
subnets, spanning multiple VLANs. Assigning owner- 
ship of arbitrary IP address ranges of these subnets to 
specified Usher users would be impossible using 
DHCP. As a result, an Usher plugin handles IP address 
management across these subnets. 


LDAP Server: The SysNet LDAP plugin serves 
two purposes. First, it provides methods for managing 
and authenticating Usher users. Second, it provides the 
convenience of creating a branch in the LDAP data- 
base for each cluster an Usher user creates. This 
branch enables each VM the user creates to authenti- 
cate its users through the LDAP database. 


This functionality provides a convenient authen- 
tication service to virtual cluster creators. First, it al- 
lows Usher users to use their Usher credentials as their 
VM login credentials since they are automatically 
added as a user in each cluster created. Since each 
cluster uses a different branch in the LDAP database, 
we use aliasing in LDAP to provide Usher users a sin- 
gle set of credentials. In addition, the plugin adds each 
Usher user to the “admin” group of each cluster the 
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user creates. VM filesystems can then be configured to 
grant special privileges to this group (e.g., sudo privi- 
leges). This approach is convenient when using a read- 
only NFS root filesystem where no default root pass- 
word is set. 


Second, and more importantly, this arrangement 
addresses the cluster authentication problem for Usher 
users in the SysNet group. Authentication for clusters 
is challenging enough for experienced administrators. 
Delegating this problem to users is not only time con- 
suming for them, but could lead to insecure VMs. 


Creating a separate branch for each cluster allows 
Usher users to create accounts and groups for their 
clusters without burdening the Usher administrator with 
this task. This capability is especially conducive to col- 
laborative work, a common case in a research lab set- 
ting. An administrator could easily be overwhelmed 
with management requests in a setting where users are 
free to create their own clusters, yet are unable to fully 
manage them. This approach pushes many mundane 
administrative tasks out to the users who have the in- 
centive to create accounts on their VMs. 


Allowing Usher users to modify the LDAP data- 
base requires careful configuration of the LDAP serv- 
er, however. An LDAP server configuration file that 
allows Usher users to only manage branches which 
they own is included with the Usher source code. In 
addition, the Usher plugin for the LDAP server in- 
cludes scripts for installation on a user’s VM filesys- 
tems to modify cluster LDAP entries (i.e., to add, 
modify, or delete users and groups). 


DNS Server: By default, Usher names VMs us- 
ing the following naming scheme: 
<requested VM name>.<creator’s username>. 
<Usher system domain name> 
where the Usher system domain name is specified in a 
configuration file read by the controller at startup. The 
DNS plugin adds this name for both forward and re- 
verse name resolution for each VM. 


NAS Server: Live migration of virtual machines 
requires a filesystem accessible by the VM at both the 
source and destination VMM. Since migration is a re- 
quirement of the SysNet installation, SysNet VMs 
must have their root filesystems provided via network- 
attached storage. These filesystems are served read- 
only NFS. 


Serving the root filesystem read-only has multi- 
ple benefits. First, it is straightforward to keep filesys- 
tems across all running VMs synchronized and updat- 
ed using read-only NFS root filesystems. Furthermore, 
an experienced administrator can manage this filesys- 
tem to ensure that it is secure (e.g., default firewall 
rules, minimal services started by default, latest secu- 
rity patches, etc.). 

Second, since all VMs mount this filesystem, it 
is important that it be as responsive as possible. Ensur- 
ing that the NFS server serving this filesystem is read- 
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only helps improve performance. Furthermore, an ad- 
ministrator can configure a read-only NFS server to 
cache the entire filesystem in main memory. As a re- 
sult, reads go to disk only once. 


One issue with using a read-only root filesystem 
is that some files and directories on the filesystem 
must be writable at system startup. We solve this prob- 
lem using a ramdisk for any files and directories 
which must be writable. Early in the boot process, 
these files and directories are copied into the ramdisk, 
then mounted using the --bind flag to make them 
writable. 


Since the SysNet installation serves its root 
filesystems read-only, another NFS server provides 
persistent writable storage. The filesystem plugin ini- 
tializes the filesystems to be mounted prior to starting 
up a new VM. This plugin creates the following direc- 
tories for each VM on the group’s read-write NFS 
server: 

e /net/global: This directory is where users in- 
stall or store anything they would like to have 
globally accessible by all of their clusters. The 
contents of /net/global is the same for all VMs a 
user creates. 

¢ /net/cluster: This directory is where users can 
store files they want accessible by the current 
cluster only. The contents of /net/cluster is the 
same for all VMs in the same cluster. 

¢ /net/local: This directory is unique to the cur- 
rent VM only. The contents of /net/local is dif- 
ferent for every VM a user creates. Users can 
use this directory to set up services and config- 
uration files specific to particular VMs. 


Finally, all SysNet users are given a home direc- 
tory. Automount takes care of mounting these directo- 
ries upon login. Alternatively, Usher users can choose 
an alternate URI (stored in LDAP) for their home di- 
rectory. 


In each of /net/global, /net/cluster, and /net/local, 
there exists a System V init style directory structure in 
the etc directory. Startup scripts in the directory for the 
appropriate runlevel are run from these three locations 
after the regular system startup scripts run. With this 
configuration, even though users cannot write to the 
root filesystem to change startup scripts, they can have 
services started for their VMs at VM boot. 

RRC-KI 

Usher has also been deployed at the Russian Re- 
search Center, Kurchatov Institute (RRC-KI). The RRC- 
KI deployment demonstrates the flexibility of Usher to 
integrate with different computing environments, and to 
employ different resource utilization policies. Whereas 
the UCSD SysNet Usher deployment targeted a general- 
purpose computing environment, the RRC-KI Usher de- 
ployment targets a batch job execution system that pro- 
vides guaranteed resources to jobs. 


RRC-KI contributes part of its compute infra- 
structure to the Large Hadron Collider (LHC) Grid 
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effort [1]. Scientists submit jobs to the system, which 
are scheduled via a batch job scheduler. Jobs are as- 
signed to physical machines, and one machine only 
runs a single job at any time. 


Measurements spanning over a year indicated 
that the overall utilization of machines in this system 
is fairly low [13]. While there were some long, com- 
pute intensive jobs, there was a large fraction of short, 
I/O driven jobs. Motivated by these measurements, the 
goal was to build a flexible job execution system that 
would improve the aggregate resource utilization of 
the cluster. 


A straightforward approach is to multiplex sever- 
al jobs on a single machine, and power down the un- 
used machines. However, conventional process-based 
multiplexing on commodity operating systems is in- 
feasible for a variety of reasons, some social and some 
technical: scientists want at least the appearance of ab- 
solute resource guarantees for their jobs; jobs often 
span multiple processes, which makes resource ac- 
counting and allocation challenging; and the number 
of physical machines needed depends on the workload 
and cannot be assigned a priori. 


Virtual machines are a natural solution to this 
problem. Since each job gets its own isolated execution 
environment, resource accounting becomes easier for 
multi-process jobs. VMs also provide much stronger 
isolation guarantees than conventional processes. Each 
job can be given guaranteed resource reservations while 
still maintaining the abstraction of a physical machine. 
A trace-driven simulation showed that a VM-based in- 
frastructure would enable significant savings [13]. 


One of the biggest challenges to this approach is 
management. For a VM-based infrastructure to scale, 
we need an automated system for deploying and manag- 
ing virtual machines, a system that can schedule VMs in 
an intelligent manner, and migrate and place VMs to op- 
timize utilization without sacrificing performance. A 
prototype system is currently being used at RRC-KI 
with Usher as the core management framework. 


Central to this infrastructure is the Policy Dae- 
mon responsible for job scheduling and dynamically 
managing virtual machines (creation, migration, de- 
struction) as a function of the current workload. The 
Policy Daemon uses the Usher Client API to monitor 
VM status and control VM resource utilization from a 
single control point using secure connections to the 
physical hosts. The current testbed comprises of a 
small number of nodes hosting production Grid jobs in 
the Usher-based environment with plans to expand the 
system to manage a few hundred nodes [15]. 


Adoption Considerations 


Usher was designed to be a flexible, extensible 
framework for managing virtual cluster environments. 
However, our claims are supported only to the extent 
of what we have implemented and tested. At the time 
of this writing, we have used Usher with one VMM 
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implementation and the specific instances of plugins 
for UCSD and RRC-KI. For other sites to use Usher, if 
the existing plugins do not match their needs as imple- 
mented, then they will have to modify existing plugins 
or write their own. To this end, we do encourage Ush- 
er users to share any modified or new plugins they 
have implemented. 


A final consideration is that of managing clusters 
of physical machines. Though the design of the frame- 
work does not preclude managing clusters of physical 
machines, to date, no plugins for managing physical 
clusters have been written. 


Conclusions 


Usher is an extensible, event-driven management 
system for clusters of virtual machines. The Usher 
core implements basic virtual machine and cluster 
management mechanisms, such as creating, destroy- 
ing, and migrating VMs. Usher clients are applications 
that serve as user interfaces to the system, such as the 
interactive command-line shell Ush, as well as appli- 
cations that use Usher as a foundation for creating and 
manipulating virtual machines for their own purposes. 
Usher supports customizable plugin modules for flexi- 
bly integrating Usher into other administrative ser- 
vices at a site, and for installing policies for the use, 
placement, and scheduling of virtual machines accord- 
ing to the site-specific requirements. Usher has been in 
production use both at UCSD and at the Russian Re- 
search Center in Kurchatov, Russia, and initial feed- 
back from both users and administrators indicates that 
Usher is successfully achieving its goals. 


Usher is free software distributed under the new 
BSD license. Source code, documentation, and tutorials 
are available at http://usher.ucsd.edu. Source code, con- 
figuration files, and initialization scripts for the UCSD 
plugins are also available for download at the site 
above. 


Acknowledgments 


The authors would like to thank Roman Kurakin 
for his insight, patches, and administration of Usher at 
RRC-KI. We also want to thank those people using 
Usher for their research at UCSD. Their feedback has 
been invaluable to the success of Usher in a research 
and academic environment. Finally, we would like to 
thank Alva Couch and our anonymous reviewers for 
their time and insightful comments regarding this pa- 
per. Support for this work was provided in part by 
NSF under CSR-PDOS Grant No. CNS-0615392 and 
the UCSD Center for Networked Systems. 


Author Biographies 


Marvin McNett is a Ph.D. student in the Systems 
and Networking group at the University of California, 
San Diego. His current research focus is virtual machine 
scheduling and management for efficient resource 


McNett, et al. 


utilization. He is the original developer and current 
maintainer of the Usher project. Marvin expects to fin- 
ish his Ph.D. in December, 2007. 


Diwaker Gupta is a Ph.D. student in the Systems 
and Networking group at the University of California, 
San Diego. His current research interests include re- 
source management and performance isolation mecha- 
nisms in virtual machines. 


Amin Vahdat is a Professor in the Department of 
Computer Science and Engineering and the Director 
of the Center for Networked Systems at the University 
of California San Diego. He received his Ph.D. in 
Computer Science from UC Berkeley in 1998. Before 
joining UCSD in January 2004, he was on the faculty 
at Duke University from 1999-2003. 


Geoffrey M. Voelker is an Associate Professor at 
the University of California at San Diego. His research 
interests include operating systems, distributed sys- 
tems, networking, and wireless networks. He received 
a B.S. degree in Electrical Engineering and Computer 
Science from the University of California at Berkeley 
in 1992, and the M.S. and Ph.D. degrees in Computer 
Science and Engineering from the University of Wash- 
ington in 1995 and 2000, respectively. 


Bibliography 


[1] LCG project, http://leg.web.cern.ch/LCG/ . 

[2] Python, http://www.python.org/ . 

[3] Twisted, http://twistedmatrix.com/ . 

[4] VirtualCenter, http://www.vmware.com/products/ 
vi/ve/ . 

[5] XenEnterprise, http://www.xensource.com/products/ 
xen_enterprise/ . 

[6] Albrecht, Jeannie, Ryan Braud, Darren Dao, Nik- 
olay Topilski, Christopher Tuttle, Alex C. Snoeren, 
and Amin Vahdat, ““Remote Control: Distributed 
Application Configuration, Management, and Vi- 
sualization with Plush,” Proceedings of the Twen- 
ty-first USENIX Large Installation System Admini- 
stration Conference (LISA), November, 2007. 

[7] Albrecht, Jeannie, Christopher Tuttle, Alex C. 
Snoeren, and Amin Vahdat, ‘PlanetLab Applica- 
tion Management Using Plush,” ACM Operating 
Systems Review (SIGOPS-OSR), Vol. 40, Num. 1, 
January, 2006. 

[8] Anderson, Thomas E., David E. Culler, David A. 
Patterson, and the NOW Team, “‘A Case for Net- 
works of Workstations: NOW,” JEEE Micro, 
February, 1995. 

[9] Appleby, K., S. Fakhouri, L. Fong, M. K. G. 
Goldszmidt, S. Krishnakumar, D. Pazel, J. Persh- 
ing, and B. Rochwerger, ““Océano — SLA-based 
Management of a Computing Utility,” Proceed- 
ings of the IFIPIEEE Symposium on Integrated 
Network Management, May, 2001. 

[10] Begnum, Kyrre, ‘Managing Large Networks of 
Virtual Machines,” Proceedings of the 20th Large 


180 21st Large Installation System Administration Conference (LISA ’07) 


McNett, et al. 


Installation System Administration Conference, pp. 
205-214, 2006. 

[11] Bruno, G., M. J. Katz, F. D. Sacerdoti, and P. M. 
Papadopoulos, “Rolls: Modifying a Standard Sys- 
tem Installer to Support User-customizable Cluster 
Frontend Appliances,” JEEE International Confer- 
ence on Cluster Computing, 2004. 

[12] Chase, Jeffrey S., David E. Irwin, Laura E. Grit, 
Justin D. Moore, and Sara E. Sprenkle, ““Dynam- 
ic Virtual Clusters in a Grid Site Manager,” Pro- 
ceedings of the 12th IEEE International Sympo- 
sium on High Performance Distributed Comput- 
ing (HPDC’03), 2003. 

[13] Cherkasova, Ludmila, Diwaker Gupta, Roman Ku- 
rakin, Vladimir Dobretsov, and Amin Vahdat, ““Op- 
timising Grid Site Manager Performance With Vir- 
tual Machines,” Proceedings of the 3rd USENIX 
Workshop on Real Large Distributed Systems 
(WORLDS), 2006. 

[14] Grit, Laura, David Irwin, Aydan Yumerefendi, 
and Jeff Chase, “Harnessing Virtual Machine Re- 
source Control for Job Management,” Proceed- 
ings of the First International Workshop on Virtu- 
alization Technology in Distributed Computing 
(VTDC), November, 2006. 

[15] Kurakin, Roman, Personal communication, Email 
dated 5/10/2007. 

[16] Merkey, Phil, Beowulf History, http://www.be- 
owulf.org/overview/history.html . 

[17] Ruth, P., Junghwan Rhee, Dongyan Xu, R. Ken- 
nell, and S. Goasguen, “Autonomic Live Adapta- 
tion of Virtual Computational Environments in a 
Multi-Domain Infrastructure,” JEEE Internation- 
al Conference on Autonomic Computing, June, 
2006. 

[18] Sacerdoti, F. D., S. Chandra, and K. Bhatia, 
“Grid Systems Deployment and Management Us- 
ing Rocks,” JEEE International Conference on 
Cluster Computing, 2004. 

[19] Vogels, Werner and Dan Dumitriu, ““An Over- 
view of the Galaxy Management Framework for 
Scalable Enterprise Cluster Computing,” Pro- 
ceedings of the IEEE International Conference 
on Cluster Computing, 2000. 

[20] Waldspurger, Carl A., ““Memory Resource Man- 
agement in VMware ESX Server,” Proceedings 
of the Fifth Symposium on Operating Systems De- 
sign and Implementation (OSDI’02), December, 
2002. 

[21] White, Brian, Jay Lepreau, Leigh Stoller, Robert 
Ricci, Shashi Guruprasad, Mac Newbold, Mike 
Hibler, Chad Barb, and Abhijeet Joglekar, “An 
Integrated Experimental Environment for Distrib- 
uted Systems and Networks,” Proceedings of the 
Fifth Symposium on Operating Systems Design 
and Implementation, pp. 255-270, USENIX As- 
sociation, Boston, MA, December, 2002. 


Usher: An Extensible Framework for Managing Clusters ... 


[22] Wood, Timothy, Prashant Shenoy, Arun Venka- 


taramani, and Mazin Yousif, ‘““Black-box and 
Gray-box Strategies for Virtual Machine Migra- 
tion,” Proceedings the Fourth Symposium on 
Networked Systems Design and Implementation 
(NSDI), April, 2007. 


21st Large Installation System Administration Conference (LISA ’07) 181 


Remote Control: Distributed 
Application Configuration, Management, 
and Visualization with Plush 


Jeannie Albrecht — Williams College 
Ryan Braud, Darren Dao, Nikolay Topilski, Christopher Tuttle, Alex C. Snoeren, 
and Amin Vahdat — University of California, San Diego 


ABSTRACT 


Support for distributed application management in large-scale networked environments re- 
mains in its early stages. Although a number of solutions exist for subtasks of application deploy- 
ment, monitoring, maintenance, and visualization in distributed environments, few tools provide a 
unified framework for application management. Many of the existing tools address the manage- 
ment needs of a single type of application or service that runs in a specific environment, and these 
tools are not adaptable enough to be used for other applications or platforms. In this paper, we 
present the design and implementation of Plush, a fully configurable application management in- 
frastructure designed to meet the general requirements of several different classes of distributed 
applications and execution environments. Plush allows developers to specifically define the flow 
of control needed by their computations using application building blocks. Through an extensible 
resource management interface, Plush supports execution in a variety of environments, including 
both live deployment platforms and emulated clusters. To gain an understanding of how Plush 
manages different classes of distributed applications, we take a closer look at specific applications 
and evaluate how Plush provides support for each. 


Introduction 


Managing distributed applications involves deploy- 
ing, configuring, executing, and debugging software run- 
ning on multiple computers simultaneously. Particularly 
for applications running on resources that are spread 
across the wide-area, distributed application management 
is a time-consuming and error-prone process. After the 
initial deployment of the software, the applications need 
mechanisms for detecting and recovering from the in- 
evitable failures and problems endemic to distributed en- 
vironments. To achieve availability and reliability, appli- 
cations must be carefully monitored and controlled to en- 
sure continued operation and sustained performance. Op- 
erators in charge of deploying and managing these appli- 
cations face a daunting list of challenges: discovering 
and acquiring appropriate resources for hosting the appli- 
cation, distributing the necessary software, and appropri- 
ately configuring the resources (and re-configuring them 
if operating conditions change). It is not surprising, then, 
that a number of tools have been developed to address 
various aspects of the process in distributed environ- 
ments, but no solution yet flexibly automates the applica- 
tion deployment and management process across all en- 
vironments. 


Presently, most researchers who want to evaluate 
their applications in wide-area distributed environ- 
ments take one of three management approaches. On 
PlanetLab [6, 26], service operators address deploy- 
ment and monitoring in an ad hoc, application-specific 


fashion using customized scripts. Grid researchers, on 
the other hand, leverage one or more toolkits (such as 
the Globus Toolkit [12]) for application development 
and deployment. These toolkits often require tight in- 
tegration with not only the infrastructure, but the ap- 
plication itself. Hence, applications must be custom 
tailored for a given toolkit, and can not easily be run 
in other environments. Similarly, system administra- 
tors who are responsible for configuring resources in 
machine-room settings often use remote execution 
tools such as cfengine [9] for managing and configur- 
ing networks of machines. As in the other two ap- 
proaches, however, the configuration files are tailored 
to a specific environment and a particular set of re- 
sources, and thus are not easily extended to other plat- 
forms. 


Motivated by the limitations of existing approach- 
es, we believe that a unified set of abstractions for 
achieving availability, scalability, and fault tolerance can 
be applied to a broad range of distributed applications, 
shielding developers from some of the complexities of 
large-scale networked environments. The primary goal 
of our research is to understand these abstractions and 
define interfaces for specifying and managing distrib- 
uted computations run in any execution environment. 
We are not trying to build another toolkit for managing 
distributed applications. Rather, we hope to define the 
way users think about their applications, regardless of 
their target platform. We took inspiration from classical 
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operating systems like UNIX [28] which defined the 
standard abstractions for managing applications: files, 
processes, pipes, etc. For most users, communication 
with these abstractions is simplified through the use of a 
shell or command-line interpreter. Of course, distributed 
computations are both more difficult to specify, because 
of heterogeneous hardware and software bases, and 
more difficult to manage, because of failure conditions 
and variable host and network attributes. Further, many 
distributed testbeds do not provide global file system ab- 
stractions, which complicates data management. 


To this end, we present Plush [27], a generic appli- 
cation management infrastructure that provides a unified 
set of abstractions for specifying, deploying, and moni- 
toring distributed applications. Although Plush was ini- 
tially designed to support applications running on Planet- 
Lab [2], Plush now provides extensions that allow users 
to manage distributed applications in a variety of com- 
puting environments. Plush users describe distributed 
computations using an extensible application specifica- 
tion language. In contrast to other application manage- 
ment systems, however, the language allows users to 
customize various aspects of the deployment life cycle to 
fit the needs of an application and its target infrastruc- 
ture. Users can, for example, specify a particular re- 
source discovery service to use during application de- 
ployment. Plush also provides extensive failure manage- 
ment support to automatically adapt to failures in the ap- 
plication and the underlying computational infrastruc- 
ture. Users interact with Plush through a simple com- 
mand-line interface or a graphical user interface (GUI). 
Additionally, Plush exports an XML-RPC interface that 
allows users to programmatically integrate their applica- 
tions with Plush if desired. 


Plush provides abstractions for managing re- 
source discovery and acquisition, software distribu- 
tion, and process execution in a variety of distributed 
environments. Applications are specified using combi- 
nations of Plush application “building blocks” that de- 
fine a custom control flow. Once an application is run- 
ning, Plush monitors it for failures or application-level 
errors for the duration of its execution. Upon detecting 
a problem, Plush performs a number of user-config- 
urable recovery actions, such as restarting the applica- 
tion, automatically reconfiguring it, or even searching 
for alternate resources. For applications requiring wide- 
area synchronization, Plush provides several efficient 
synchronization primitives in the form of partial barri- 
ers, which help applications achieve better performance 
and robustness in failure-prone environments [1]. 


The remainder of this paper discusses the archi- 
tecture of Plush. We motivate the design in the next 
section by enumerating a set of general requirements 
for managing distributed applications. Subsequently, 
we present details about the design and implementa- 
tion of Plush and then provide specific application 
case studies and uses of Plush. Related work is shown 
in the next section which is followed by the conclu- 
sion. 
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Application Management Requirements 


To better understand the requirements of a dis- 
tributed application management framework, we first 
consider how we might run a specific application in a 
widely-used distributed environment. In particular, we 
investigate the process of running SWORD [23], a 
publicly-available resource discovery service, on Plan- 
etLab. SWORD uses a distributed hash table (DHT) 
for storing data, and aims to run on as many hosts as 
possible, as long as the hosts provide some minimum 
level of availability (since SWORD provides a service 
to other PlanetLab users). Before starting SWORD, 
we have to find and gain access to PlanetLab ma- 
chines capable of hosting the service. Since SWORD 
is most concerned with reliability, it does not necessar- 
ily need powerful machines, but it must avoid nodes 
that frequently perform poorly over a relatively long 
time. We locate reliable machines using a tool like 
CoMon [25], which monitors resource usage on Plan- 
etLab, and then we install the SWORD software on 
those machines. 


This software installation involves downloading 
the SWORD software package on each host individu- 
ally, unpacking the software, and installing any soft- 
ware dependencies, including a Java Runtime Envi- 
ronment. After the software has been installed on all 
of the selected machines, we start the SWORD execu- 
tion. Recall that reliability is important to SWORD, so 
if an error or failure occurs at any point, we need to 
quickly detect it (perhaps using custom scripts and 
cron jobs) and restore the service to maintain high 
availability. 


Running SWORD on PlanetLab is an example of 
a specific distributed application deployment. The 
low-level details of managing distributed applications in 
general largely depend on the characteristics of the tar- 
get application and environment. For example, long- 
running services such as SWORD prefer reliable ma- 
chines and attempt to dynamically recover from failures 
to ensure high availability. On the other hand, short- 
lived scientific parallel applications (e.g., EMAN [18]) 
prefer powerful machines with high bandwidth/low la- 
tency network connections. Long term reliability is not a 
huge concern for these applications, since they have 
short execution times. At a high level, however, if we 
ignore the complexities associated with resource man- 
agement, the requirements for managing distributed ap- 
plications are largely similar for all applications and en- 
vironments. Rather than reinvent the same infrastructure 
for each class separately, our goal is to identify common 
abstractions that support the execution of many types of 
distributed applications, and to build an application- 
management infrastructure that supports the general re- 
quirements of all applications. In this section, we identi- 
fy these general requirements for distributed application 
management. 


Specification. A generic application controller 
must allow application operators to customize the 
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control flow for each application. This specification is 
an abstraction that describes distributed computations. 
A specification identifies all aspects of the execution 
and environment needed to successfully deploy, man- 
age, and maintain an application, including the soft- 
ware required to run the application, the processes that 
will run on each machine, the resources required to 
achieve the desired performance, and any environ- 
ment-specific execution parameters. User credentials 
for resources must also be included in the application 
specification in order to obtain access to resources. To 
manage complex multi-phased computations, such as 
scientific parallel applications, the specification must 
support application synchronization requirements. Simi- 
larly, distributing computations among pools of ma- 
chines requires a way to specify a workflow — a collec- 
tion of tasks that must be completed in a given order — 
within an application specification. 


The complexity of distributed applications varies 
greatly from simple, single-process applications to 
elaborate, parallel applications. Thus the challenge is 
to define a specification language abstraction that pro- 
vides enough expressibility for complex distributed 
applications, but is not too complicated for single- 
process computations. In short, the language must be 
simple enough for novice application developers to 
understand, yet expose enough advanced functionality 
to run complex scenarios. 


Resource Discovery and Acquisition. Another 
key abstraction in distributed applications are re- 
sources. Put simply, resources are computing devices 
that are connected to a network and are capable of 
hosting an application. Because resources in distrib- 
uted environments are often heterogeneous, applica- 
tion developers naturally want to find the resource set 
that best satisfies the demands of their application. 
Even if hardware is largely homogeneous, dynamic re- 
source characteristics such as available bandwidth or 
CPU load can vary over time. The goal of resource 
discovery is to find the best current set of resources 
for the distributed application as described in the spec- 
ification. In environments that support dynamic virtual 
machine instantiation [5, 30], these resources may not 
exist in advance. Thus, resource discovery in this case 
involves finding the appropriate physical machines to 
host the virtual machine configurations. 


Resource discovery systems often interact directly 
with resource acquisition systems. Resource acquisition 
involves obtaining a lease or permission to use the de- 
sired resources. Depending on the execution environ- 
ment, acquisition can take a number of forms. For ex- 
ample, to support advanced resource reservations as in 
a batch pool, resource acquisition is responsible for 
submitting a resource request and subsequently obtain- 
ing a lease from the scheduler. In virtual machine envi- 
ronments, resource acquisition may involve instantiat- 
ing virtual machines, verifying their successful creation, 
and gathering the appropriate information (e.g., IP 
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address, authentication keys) required for access. The 
challenge facing an application management framework 
is to provide a generic resource-management interface. 
Ultimately, the complexities associated with creating and 
gaining access to physical or virtual resources should be 
hidden from the application developer. 


Deployment. Upon obtaining an appropriate set 
of resources, the application-deployment abstraction 
defines the steps required to prepare the resources 
with the correct software and data files, and run any 
necessary executables to start the application. This in- 
volves copying, unpacking, and installing the software 
on the target hosts. The application controller must 
support a variety of different file-transfer mechanisms 
for each environment, and should react to failures that 
occur during the transfer of software or in starting exe- 
cutables. 


One important aspect of application deployment 
is configuring the requested number of resources with 
compatible versions of the software. Ensuring that a 
minimum number of resources are available and cor- 
rectly configured for a computation may involve re- 
questing new resources from the resource discovery 
and acquisition systems to compensate for failures that 
occur at startup. Further, many applications require 
some form of synchronization across hosts to guaran- 
tee that various phases of computation start at approxi- 
mately the same time. Thus, the application controller 
must provide mechanisms for loose synchronization. 


Maintenance. Perhaps the most difficult require- 
ment for managing distributed applications is monitor- 
ing and maintaining an application after execution be- 
gins. Thus, another abstraction that the application 
controller must define is support for customizable ap- 
plication maintenance. One key aspect of maintenance 
is application and resource monitoring, which involves 
probing hosts for failure due to network outages or 
hardware malfunctions, and querying applications for 
indications of failure (often requiring hooks into appli- 
cation-specific code for observing the progress of an 
execution). Such monitoring allows for more specific 
error reporting and simplifies the debugging process. 


In some cases, system failures may result in a sit- 
uation where application requirements can no longer 
be met. For example, if an application is initially con- 
figured to be deployed on 50 resources, but only 48 
can be contacted at a certain point in time, the applica- 
tion controller should adapt the application, if possi- 
ble, and continue executing with only 48 machines. 
Similarly, different applications have different policies 
and requirements with respect to failure recovery. 
Some applications may be able to simply restart a 
failed process on a single host, while others may re- 
quire the entire execution to abort in the case of fail- 
ure. Thus, in addition to the other features previously 
described, the application controller should support a 
variety of options for failure recovery. 
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Plush: Design and Implementation 


We now describe Plush, an extensible distributed 
application controller, designed to address the require- 
ments of large-scale distributed application manage- 
ment discussed in the second section. 


To directly monitor and control distributed appli- 
cations, Plush itself must be distributed. Plush uses a 
client-server architecture, with clients running on each 
resource (e.g., machine) involved in the application. 
The Plush server, called the controller, interprets input 
from the user (i.e., the person running the application) 
and sends messages on behalf of the user over an over- 
lay network (typically a tree) to Plush clients. The 
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controller, typically run from the user’s workstation, 
directs the flow of control throughout the life of the 
distributed application. The clients run alongside each 
application component across the network and per- 
form actions based upon instructions received from 
the controller. 


Figure la shows an overview of the Plush con- 
troller architecture. (The client architecture is symmet- 
ric to the controller with only minor differences in 
functionality.) The architecture consists of three main 
sub-systems: the application specification, core func- 
tional units, and user interface. Plush parses the appli- 
cation specification provided by the user and stores 


User Interface 


Application Block 





Figure la: The architecture of Plush. The user interface is shown above the rest of the architecture and contains 
methods for interacting with all boxes in the lower sub-systems of Plush. Boxes below the user interface and 
above the dotted line indicate objects defined within the application specification abstraction. Boxes below the 


line represent the core functional units of Plush. 





Component Block 1 
Senders 


Process Block 1 /app/comp1/proc1 


prepare_files.pl 


Process Block 2 /app/comp1/proc2 
join_overlay.pl 


Barrier Block 1 = = =—/app/comp1/barr1 ~ 


Process Block 3 /app/comp1/proc3 
send_files.pl 





ceatiz i eateries 


iat) 





STARS; a eee teri hy ei aT ESET EST 


Component Block 2 
Receivers 


Process Block 1 /app/comp2/proc1 
join_overlay.pl 


Process Block 2 /app/comp2/proc2 
receive_files.pl 


—_—_—— FEET Ee pee eae Fao NY TST 
EREPLPTADS EAP Et Fe taunt d Seek een rar Fi She ere bees vals tercr ee eee 


Figure 1b: Example file-distribution application comprised of application, component, process, and barrier blocks in 
Plush. Arrows indicate control-flow dependencies (i.e., X —- Y implies that X must complete before Y starts). 
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internal data structures and objects specific to the ap- 
plication being run. The core functional units then ma- 
nipulate and act on the objects defined by the applica- 
tion specification to run the application. The function- 
al units also store authentication information, monitor 
physical machines, handle event and timer actions, 
and maintain the communication infrastructure that 
enables the controller to query the status of the distrib- 
uted application on the clients. The user interface pro- 
vides the functionality needed to interact with the oth- 
er parts of the architecture, allowing the user to main- 
tain and manipulate the application during execution. 
In this section, we describe the design and implemen- 
tation details of each of the Plush sub-systems.1 


Application Specification 


Developing a complete, yet accessible, applica- 
tion specification language was one of the principal 
challenges in this work. Our approach, which has 
evolved over the past three years, consists of combina- 
tions of five different abstractions: 

1. Process blocks — Describe the processes exe- 
cuted on each machine involved in an applica- 
tion. The process abstraction includes runtime 
parameters, path variables, runtime environ- 
ment details, file and process I/O information, 
and the specific commands needed to start a 
process on a remote machine. 

2. Barrier blocks — Describe the barriers that are 
used to synchronize the various phases of exe- 
cution within a distributed application. 

3. Workflow blocks — Describe the flow of data 
in a distributed computation, including how the 
data should be processed. Workflow block 
may contain process and barrier blocks. For ex- 
ample, a workflow block might describe a set 
of input files over which a process or barrier 
block will iterate during execution. 

4. Component blocks — Describe the groups of re- 
sources required to run the application. This in- 
cludes expectations specific to a set of metrics 
for the target resources. In the case of compute 
nodes in a cluster, for example, these metrics 
might include maximum load requirements and 
minimum free memory requirements. Compo- 
nents also define required software configura- 
tions, installation instructions, and any authenti- 
cation information needed to access the re- 
sources. Component blocks may contain work- 
flow blocks, process blocks, and barrier blocks. 

5. Application blocks — Describe high-level in- 
formation about a distributed application. This 
includes one or many component blocks, as 
well as attributes to help automate failure re- 
covery. 

To better illustrate the use of these blocks in 
Plush, consider building the specification for a simple 

‘Note that the components within the sub-systems are high- 
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file-distribution application as shown in Figure 1b. 
This application consists of two groups of machines. 
One group, the senders, stores the files, and the second 
group, the receivers, attempt to retrieve the files from 
the senders. The goal of the application is to experi- 
ment with the use of an overlay network to send files 
from the senders to the receivers using some new file- 
distribution protocol. In this example, there are two 
phases of execution. In the first phase, all senders and 
receivers join the overlay before any transfers begin, 
and the senders must prepare the files for transfer be- 
fore the receivers start receiving files. In the second 
phase, the receivers begin receiving the files. No new 
senders or receivers are allowed to join the network 
during the second phase. 


The first step in building the corresponding Plush 
application specification for our new file-distribution 
protocol is to define an application block. The applica- 
tion block defines general characteristics about the ap- 
plication including liveness properties and failure de- 
tection and recovery options, which determine default 
failure recovery behavior. For this example, we choose 
the behavior “restart-on-failure,”” which attempts to 
restart the failed application instance on a single host, 
since it is not necessary to abort the entire application 
across all hosts if only a single failure occurs. 


The application block also contains one or many 
component blocks that describe the groups of re- 
sources (i.e., machines) required to run the applica- 
tion. Our application consists of a set of senders and a 
set of receivers, and two separate component blocks 
describe the two groups of machines. The sender com- 
ponent block defines the location and installation in- 
structions for the sender software, and includes au- 
thentication information to access the resources. Simi- 
larly, the receiver component block defines the receiv- 
er software package. In our example, it may be desir- 
able to require that all machines in the sender group 
have a processor speed of at least 1 GHz, and each 
sender should have sufficient bandwidth for sending 
files to multiple receivers at once. These types of ma- 
chine-specific requirements are included in the com- 
ponent blocks. Within each component block, a com- 
bination of workflow, process, and barrier blocks de- 
scribe the distributed computation.? 


Plush process blocks describe the specific com- 
mands required to execute the application. Most process 
blocks depend on the successful installation of software 
packages defined in the component blocks. Users speci- 
fy the commands required to start a given process, and 
actions to take upon process exit. The exit policies cre- 
ate a Plush process monitor that oversees the execution 
of a specific process. Our example has several process 
blocks. In the sender component, process blocks define 
processes for preparing the files, joining the overlay, and 


2Although our example does not use workflow blocks, they 
are used in applications where data files must be distributed 
and iteratively processed. 
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sending the files. Similarly, the receiver component con- 
tains process blocks for joining the overlay and receiv- 
ing the files. 


Some applications operate in phases, producing 
output files in early stages that are used as input files 
in later stages. To ensure all hosts start each phase of 
computation only after the previous phase completes, 
barrier blocks define loose synchronization semantics 
between process and workflow blocks. In our exam- 
ple, a barrier ensures that all receivers and senders join 
the overlay in phase one before beginning the file 
transfer in phase two. Note that although each barrier 
block is uniquely defined within a component block, it 
is possible for the same barrier to be referenced in 
multiple component blocks. We use barrier blocks in 
our example within each component block that refer to 
the same barrier, which means that the application will 
wait for all receivers and senders to reach the barrier 
before allowing either component to start sending or 
receiving files. 


In Figure 1b, the outer application block contains 
our two component blocks that run in parallel (since 
there are no arrows indicating control-flow dependen- 
cies between them). Within the component blocks, the 
different phases are separated by the bootstrap barrier 
that is defined by barrier block 1. Component block 1, 
which describes the senders, contains process blocks 1 
and 2 that define perl scripts that run in parallel during 
phase one, synchronize on the barrier in barrier block 
1, and then proceed to process block 3 in phase two 
which sends the files. Component block 2, which de- 
scribes the receivers, runs process block | in phase 
one, synchronizes on the barrier in barrier block 1, and 
then proceeds to process block 2 in phase two which 
runs the process that receives the files. In our imple- 
mentation, the blocks are represented by XML that is 
parsed by the Plush controller when the application is 
run. We show an example of the XML later. 


We designed the Plush application specification 
to support a variety of execution patterns. With the 
blocks described above, Plush supports the arbitrary 
combination of processes, barriers, and workflows, 
provided that the flow of control between them forms 
a directed acyclic graph. Using predecessor tags in 
Plush, users specify the flow of control and define 
whether processes run in parallel or sequentially. Ar- 
rows between blocks in Figure 1b, for example, indi- 
cate the predecessor dependencies. (Process blocks 1 
and 2 in component block 1 will run in parallel before 
blocking at the bootstrap barrier, and then the execu- 
tion will continue on to process block 3 after the boot- 
strap barrier releases.) Internally, Plush stores the 
blocks in a hierarchical data structure, and references 
specific blocks in a manner similar to referencing ab- 
solute paths in a UNIX file system. Figure 1b shows 
the unique path names for each block from our file- 
distribution example. This naming abstraction also 
simplifies coordination among remote hosts. Each 
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Plush client maintains an identical local copy of the 
application specification. Thus, for communication re- 
garding control flow changes, the controller sends the 
clients messages indicating which “block” is current- 
ly being executed, and the clients update their local 
state information accordingly. 


Core Functional Units 


After parsing the block abstractions defined by 
the user within the application specification, Plush in- 
stantiates a set of core functional units to perform the 
operations required to configure and deploy the dis- 
tributed application. Figure la shows these units as 
shaded boxes below the dotted line. The functional 
units manipulate the objects defined in the application 
specification to manage distributed applications. In 
this section, we describe the role of each of these 
units. 


Starting at the highest level, the Plush resource 
discovery and acquisition unit uses the resource re- 
quirements in the component blocks to locate and cre- 
ate (if necessary) resources on behalf of the user. The 
resource discovery and acquisition unit is responsible 
for obtaining a valid set, called a matching, of re- 
sources that meet the application’s demands. To deter- 
mine this matching, Plush may either call an existing 
external service to construct a resource pool, such as 
SWORD or CoMon for PlanetLab, or use a statically 
defined resource pool based on information provided 
by the user. The Plush resource matcher then uses the 
resources in the resource pool to create a matching for 
the application. All hosts involved in an application 
run a Plush host monitor that periodically publishes 
information about the host. The resource discovery 
and acquisition unit may use this information to help 
find the best matching. Upon acquiring a resource, a 
Plush resource manager stores the lease, token, or 
any necessary user credential needed for accessing 
that resource to allow Plush to perform actions on be- 
half of the user in the future. 


The remaining functional units in Figure la are 
responsible for application deployment and mainte- 
nance. These units connect to resources, install re- 
quired software, start the execution, and monitor the 
execution for failures. One important functional unit 
used for these operations is the Plush barrier manag- 
er, which provides advanced synchronization services 
for Plush and the application itself. In our experience, 
traditional barriers [17] are not well suited for volatile, 
wide-area network conditions; the semantics are sim- 
ply too strict. Instead, Plush uses partial barriers, 
which are designed to perform better in volatile envi- 
ronments [1]. Partial barriers ensure that the execution 
makes forward progress in the face of failures, and im- 
prove performance in failure-prone environments us- 
ing relaxed synchronization semantics. 

The Plush file manager handles all files required 
by a distributed application. This unit contains informa- 
tion regarding software packages, file transfer methods, 
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installation instructions, and workflow data files. The 
file manager is responsible for preparing the physical re- 
sources for execution using the information provided by 
the application specification. It monitors the status of 
file transfers and installations, and if it detects an error 
or failure, the controller is notified and the resource dis- 
covery and acquisition unit may be required to find a 
new host to replace the failed one. 


Once the resources are prepared with the neces- 
sary software, the application deployment phase com- 
pletes by starting the execution. This is accomplished 
by starting a number of processes on remote hosts. 
Plush processes are defined within process blocks in 
the application specification. A Plush process is an ab- 
straction for standard UNIX processes that run on 
multiple hosts. Processes require information about 
the runtime environment needed for an execution in- 
cluding the working directory, path, environment vari- 
ables, file I/O, and the command line arguments. 


The two lowest layers of the Plush architecture 
consist of a communication fabric and the I/O and 
timer subsystems. The communication fabric handles 
passing and receiving messages among Plush overlay 
participants. Participants communicate over TCP con- 
nections. The default topology for a Plush overlay is a 
star, although we also provide support for tree topolo- 
gies for increased scalability (discussed later in detail). 
In the case of a star topology, all clients connect direct- 
ly to the controller, which allows for quick failure de- 
tection and recovery. The controller sends messages to 
the clients instructing them to perform certain actions. 
When the clients complete their tasks, they report back 
to the controller for further direction. The communica- 
tion fabric at the controller knows what hosts are in- 
volved in a particular application instance, so that the 
appropriate messages reach all necessary hosts. 


At the bottom of all of the other units is the Plush 
I/O and timer abstraction. As messages are received in 
the communication fabric, message handlers fire events. 
These events are sent to the I/O and timer layer and en- 
ter a queue. The event loop pulls events off the queue, 
and calls the appropriate event handler. Timers are a 
special type of event in Plush that fire at a predefined in- 
stant. 


Fault Tolerance and Scalability 


Two of the biggest challenges that we encoun- 
tered during the design of Plush was being robust to 
failures and scaling to hundreds of machines spread 
across the wide-area. In this section we explore fault 
tolerance and scalability in Plush. 

Fault Tolerance 

Plush must be robust to the variety of failures 
that occur during application execution. When design- 
ing Plush, we aimed to provide the functionality need- 
ed to detect and recover from most failures without 
ever needing to involve the user running the applica- 
tion. Rather than enumerate all possible failures that 
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may occur, we will discuss how Plush handles three 
common failure classes — process, host, and controller 
failures. 


Process failures. When a remote host starts a 
process defined in a process block, Plush attaches a 
process monitor to the process. The role of the process 
monitor is to catch any signals raised by the process, 
and to react appropriately. When a process exits either 
due to successful completion or error, the process 
monitor sends a message to the controller indicating 
that the process has exited, and includes its exit status. 
Plush defines a default set of behaviors that occur in 
response to a variety of exit codes (although these can 
be overridden within an application specification). The 
default behaviors include ignoring the failure, restart- 
ing only the failed process, restarting the application, 
or aborting the entire application. 


In addition to process failures, Plush also allows 
users to monitor the status of a process that is still run- 
ning through a specific type of process monitor called 
a liveness monitor, whose goal is to detect misbehav- 
ing and unresponsive processes that get stuck in loops 
and never exit. This is especially useful in the case of 
long-running services that are not closely monitored 
by the user. To use the liveness monitor, the user spec- 
ifies a script and a time interval in the process block of 
the application specification. The liveness monitor 
wakes up once per time interval and runs the script to 
test for the liveness of the application, returning either 
success or failure. If the test fails, the Plush client kills 
the process, causing the process monitor to be alerted 
and inform the controller. 


Remote host failures. Detecting and reacting to 
process failures is straightforward since the controller 
is able to communicate information to the client re- 
garding the appropriate recovery action. When a host 
fails, however, recovering is more difficult. A host 
may fail for a number of reasons, including network 
outages, hardware problems, and power loss. Under 
all of these conditions, the goal of Plush is to quickly 
detect the problem and reconfigure the application 
with a new set of resources to continue execution. The 
Plush controller maintains a list of the last time suc- 
cessful communication occurred with each connected 
client. If the controller does not hear from a client 
within a specified time interval, the controller sends a 
ping to the client. If the controller does not receive a 
response from the client, we assume host failure. Reli- 
able failure detection is an active area of research; 
while the simple technique we employ has been suffi- 
cient thus far, we intend to leverage advances in this 
space where appropriate. 


There are three possible actions in response to a 
host failure: restart, rematch, and abort. By default, the 
controller tries all three actions in order. The first and 
easiest way to recover from a host failure is to simply 
reconnect and restart the application on the failed host. 
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This technique works if the host experiences a tempo- 
rary power or network outage, and is only unreachable 
for a short period of time. If the controller is unable to 
reconnect to the host, the next option is to rematch in an 
attempt to replace the failed host with a different host. 
In this case, Plush reruns the resource matcher to find a 
new machine. Depending on the application, the entire 
execution may need to be restarted across all hosts after 
the new host joins the control overlay, or the execution 
may only need to be started on the new host. If the con- 
troller is unable to find a new host to replace the failed 
host, Plush aborts the entire application. 


In some applications, it is desirable to mark a 
host as failed when it becomes overloaded or experi- 
ences poor network connectivity. The Plush host mon- 
itor that runs on each machine is responsible for peri- 
odically informing the controller about each machine’s 
status. If the controller determines that the perfor- 
mance is less than the application tolerates, it marks 
the host as failed and attempts to rematch. This func- 
tionality is a preference specified at startup. Although 
Plush currently monitors host-level metrics including 
load and free memory, the technique is easily extended 
to encompass sophisticated application-level expecta- 
tions of host viability. 


Controller failures. Because the controller is re- 
sponsible for managing the flow of control across all 
connected clients, recovering from a failure at the con- 
troller is difficult. One solution is to use a simple pri- 
mary-backup scheme, where multiple controllers in- 
crease reliability. All messages sent from the clients 
and primary controller are sent to the backup con- 
trollers as well. If a pre-determined amount of time 
passes and the backup controllers do not receive any 
messages from the primary, the primary is assumed to 
have failed. The first backup becomes the primary, 
and execution continues. 


This strategy has several drawbacks. First, it 
causes extra messages to be sent over the network, 
which limits the scalability of Plush. Second, this ap- 
proach does not perform well when a network parti- 
tion occurs. During a network partition, multiple con- 
trollers may become the primary controller for subsets 
of the clients initially involved in the application. 
Once the network partition is resolved, it is difficult to 
reestablish consistency among all hosts. While we 
have implemented this architecture, we are currently 
exploring other possibilities. 

Scalability 

In addition to fault tolerance, an application con- 
troller designed for large-scale environments must 
scale to hundreds or even thousands of participants. 
Unfortunately there is a tradeoff between performance 
and scalability. The solutions that perform the best at 
moderate scale typically do not scale as well as solu- 
tions with lower performance. To balance scalability 
and performance, Plush provides users with two topo- 
logical alternatives. 
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By default, all Plush clients connect directly to 
the controller forming a star topology. This architec- 
ture scales to approximately 300 remote hosts, limited 
by the number of file descriptors allowed per process 
on the controller machine in addition to the band- 
width, CPU, and latency required to communicate 
with all connected clients. The star topology is easy to 
maintain, since all clients connect directly to the con- 
troller. In the event of a host failure, only the failed 
host is affected. Further, the time required for the con- 
troller to exchange messages with clients is short due 
to the direct connections. 


At larger scales, network and file descriptor limi- 
tations at the controller become a bottleneck. To ad- 
dress this, Plush also supports tree topologies. In an 
effort to reduce the number of hops between the 
clients and the controller, we construct “bushy” trees, 
where the depth of the tree is small and each node in 
the tree has many children. The controller is the root 
of the tree. The children of the root are chosen to be 
well-connected and historically reliable hosts whenev- 
er possible. Each child of the root acts as a “proxy 
controller” for the hosts connected to it. These proxy 
controllers send invitations and receive joins from oth- 
er hosts, reducing the total number of messages sent 
back to the root controller. Important messages, such 
as failure notifications, are still sent back to the root 
controller. Using the tree topology, we have been able 
to use Plush to manage an application running on 1000 
ModelNet [29] emulated hosts, as well as an applica- 
tion running on 500 PlanetLab clients. We believe that 
Plush has the ability to scale by another order of mag- 
nitude with the current design. 


While the tree topology has many benefits over 
the star topology, it also introduces several new prob- 
lems with respect to host failures and tree mainte- 
nance. In the star topology, a host failure is simple to 
recover from since it only involves one host. In the 
tree topology, however, if a non-leaf host fails, all 
children of the failed host must find a new parent. De- 
pending on the number of hosts affected, a reconfigu- 
ration involving several hosts often has a significant 
impact on performance. Our current implementation 
tries to minimize the probability of this type of failure 
by making intelligent decisions during tree construc- 
tion. For example, in the case of ModelNet, many vir- 
tual hosts (and Plush clients) reside on the same physi- 
cal machine. When constructing the tree in Plush, only 
one client per physical machine connects directly to 
the controller and becomes the proxy controller. The 
remaining clients running on the same physical ma- 
chine become children of the proxy controller. In the 
wide area, similar decisions are made by placing hosts 
that are geographically close together under the same 
parent. This decreases the number of hops and latency 
between leaf nodes and their parent, minimizing the 
chance of network failures. 
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Running An Application Using Plush 


In this section, we will discuss how the architec- 
tural components of Plush interact to run a distributed 
application. When starting Plush, the user’s worksta- 
tion becomes the controller. The user submits an appli- 
cation specification to the Plush controller. The con- 
troller parses the specification, and internally creates 
the objects shown above the dotted line in Figure la. 


After parsing the application specification, the 
controller runs the resource discovery and acquisition 
unit to find a suitable set of resources that meet the re- 
quirements specified in the component blocks. Upon 
locating the necessary resources, the resource manager 
stores the required access and authentication informa- 
tion. The controller then attempts to connect to each 


Mink eeaiahie Rune 
ie [en Plush 
World View | Appticauion View | Resource View | Host view | 





Remote Control: Distributed Application ... With Plush 


remote host. If the Plush client is not already running, 
the controller initiates a bootstrapping procedure to 
copy the Plush client binary to the remote host, and 
then uses SSH to connect to the remote host and start 
the client process. Once the client process is running, 
the controller establishes a TCP connection to the re- 
mote host, and transmits an INVITE message to the host 
to join the Plush overlay (which is either a star or tree 
as discussed previously). 


If a Plush client agrees to run the application, the 
client sends a JOIN message back to the controller ac- 
cepting the invitation. Next, the controller sends a 
PREPARE message to the new client, which contains a 
copy of the application specification (XML represen- 
tation). The client parses the application specification, 
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Figure 2a: Nebula World View tab showing an application running on PlanetLab. Different colored dots indicate 


PlanetLab sites in various stages of execution. 
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Figure 2b: Nebula Application View tab displaying Plush application specification. 
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starts a local host monitor, sends a PREPARED mes- 
sage back to the controller, and waits for further in- 
struction. Once enough hosts join the overlay and 
agree to run the application, the controller initiates the 
beginning of the application deployment stage by 
sending a GO message to all connected clients. The 
file managers then begin installing the requested soft- 
ware and preparing the hosts for execution. 


In most applications, the controller instructs the 
hosts to begin execution after all hosts have completed 
the software installation. (Synchronizing the begin- 
ning of the execution is not required if the application 
does not need all hosts to start simultaneously.) Since 
each client has now created an exact copy of the con- 
troller’s application specification, the controller and 
clients exchange messages about the application’s 
progress using the block naming abstraction (i.e., 
/app/comp1/proc1) to identify the status of the execu- 
tion. For barriers, a barrier manager running on the 
controller determines when it is appropriate for hosts 
to be released from the barriers. 


Upon detecting a failure, clients notify the con- 
troller, and the controller attempts to recover from it 
according to the actions enumerated in the user’s ap- 
plication specification. Since many failures are appli- 
cation-specific, Plush exports optional callbacks to the 
application itself to determine the appropriate reaction 
for some failure conditions. When the application 
completes (or upon a user command), Plush stops all 
associated processes, transfers output data back to the 
controller’s local disk if desired, performs user-speci- 
fied cleanup actions on the resources, disconnects the 
resources from the overlay by closing the TCP con- 
nections, and stops the Plush client processes. 


User Interface 


Plush aims to support a variety of applications 
being run by users with a wide range of expertise in 
building and managing distributed applications. Thus, 
Plush provides three interfaces which each provide 
users with techniques for interacting with their appli- 
cations. We describe the functionality of each user in- 
terface in this section. 


Figure la shows the user interface above all oth- 
er parts of Plush. In reality, the user interacts with ev- 
ery box shown in the figure through the user interface. 
For example, the user forces the resource discovery 
and acquisition unit to find a new set of resources us- 
ing a terminal command. We designed Plush in this 
way to provide maximum control over the application. 
At any point, the user can override a default Plush be- 
havior. The effect is a customizable application con- 
troller that supports a variety of distributed applica- 
tions. 

Graphical User Interface 
In an effort to simplify the creation of application 


specifications and help visualize the status of execu- 
tions running on resources around the world, we 
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implemented a graphical user interface for Plush called 
Nebula. In particular, we designed Nebula (as shown 
in Figures 2a, 2b, and 3) to simplify the process of 
specifying and managing applications running across 
the PlanetLab wide-area testbed. Plush obtains data 
from the PlanetLab Central (PLC) database to deter- 
mine what hosts a user has access to, and Nebula uses 
this information to plot the sites on the map. To start 
using Nebula, users have the option of building their 
Plush application specification from scratch or loading 
a preexisting XML document representing an applica- 
tion specification. Upon loading the application speci- 
fication, the user runs the application by clicking the 
Run button from the Plush toolbar, which causes Plush 
to start locating and acquiring resources. 


The main Nebula window contains four tabs that 
show different information about user’s application. In 
the “World View” tab, users see an image of a world 
map with colored dots indicating PlanetLab hosts. Dif- 
ferent colored dots on the map indicate sites involved 
in the current application. In Figure 2a, the dots 
(which range from red to green on a user’s screen) 
show PlanetLab sites involved in the current applica- 
tion. The grey dots are other available PlanetLab sites 
that are not currently being used by Plush. As the ap- 
plication proceeds through the different phases of exe- 
cution, the sites change color, allowing the user to vi- 
sualize the progress of their application. When failures 
occur, the impacted sites turn red, giving the user an 
immediate visual indication of the problem. Similarly, 
green dots indicate that the application is executing 
correctly. If a user wishes to establish an SSH connec- 
tion directly to a particular resource, they can simply 
right-click on a host in the map and choose the SSH 
option from the pop-up menu. This opens a new tab in 
Nebula containing an SSH terminal to the host. Users 
can also mark hosts as failed by right-clicking and 
choosing the Fail option from the pop-up menu if they 
are able to determine failure more quickly than Plush’s 
automated techniques. Failed hosts are completely re- 
moved from the execution. 


Users retrieve more detailed usage statistics and 
monitoring information about specific hosts (such as 
CPU load, free memory, or bandwidth usage) by dou- 
ble clicking on the individual sites in the map. This 
opens a second window that displays real-time graphs 
based on data retrieved from resource monitoring 
tools, as shown in the bottom right corner of Figure 
2a. The second smaller window displays a graph of 
the CPU or memory usage, and the status of the appli- 
cation on each host. Plush currently provides built-in 
support for monitoring CoMon [25] data on PlanetLab 
machines, which is the source of the CPU and memo- 
ry data. Additionally, if the user wishes to view the 
CPU usage or percentage of free memory available 
across all hosts, there is a menu item under the Planet- 
Lab menu that changes the colors of the dots on the 
map such that red means high CPU usage or low free 
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memory, and green indicates low CPU usage and high 
free memory. Users can also add and remove hosts to 
their PlanetLab account (or s/ice in PlanetLab termi- 
nology) directly by highlighting regions of the map 
and choosing the appropriate menu option from the 
PlanetLab menu. Additionally, users can renew their 
PlanetLab slice from Nebula. 


The second tab in the Nebula main window is the 
“Application View.” The Application View tab, shown 
in Figure 2b, allows users to build Plush application 
specifications using the blocks described previously. Al- 
ternatively, users may load an existing XML file describ- 
ing an application specification by choosing the Load Ap- 
plication menu option under the File menu. There is also 
an option to save a new application specification to an 
XML file for later use. After creating or loading an ap- 
plication specification, the Run button located on the Ap- 
plication View tab starts the application. 


The Plush blocks in the application specification 
change to green during the execution of the applica- 
tion to indicate progress. After an application begins 
execution, users have the option to “force” an appli- 
cation to skip ahead to the next phase of execution 
(which corresponds to releasing a synchronization 
barrier), or aborting an application to terminate execu- 
tion across all resources. Once the application aborts 
or completes execution, the user may either save their 
application specification, disconnect from the Plush 
communication mesh, restart the same application, or 
load and run a new application by choosing the appro- 
priate option from the File menu. 


The third tab is the “Resource View” tab. This 
tab is blank until an application starts running. During 
execution, this tab lists the specific machines that are 
currently involved in the application. If failures occur 
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during execution, the list of machines is updated dy- 
namically, such that the Resource View tab always 
contains an accurate listing of the machines that are in 
use. The resources are separated into components, so 
that the user knows which resources are assigned to 
which tasks in their application. 


The fourth tab in Nebula is called the ‘Host 
View” tab, shown in Figure 3. This tab contains a ta- 
ble that displays the hostname of all available re- 
sources. In the right column, the status of the host is 
shown. The purpose of this tab is to give users another 
alternative to visualize the status of an executing ap- 
plication. The status of the host in the right column 
corresponds to the color of the dot in the “‘World 
View” tab. This tab also allows users to run shell 
commands simultaneously on several resources, and 
view the output. As shown in Figure 3, users can se- 
lect multiple hosts as once, run a command, and the 
output is shown in the text-box at the bottom of the 
window. Note that hosts do not have to be involved in 
an application in order to take advantage of this fea- 
ture. Plush will connect to any available resources and 
run commands on behalf of the user. Just as in the 
World View tab, right-clicking on hosts in the Host 
View tab opens a pop-up menu that enables users to 
SSH directly to the hosts. 

Command-line Interface 

Motivated by the popularity and familiarity of 
the shell interface in UNIX, Plush further streamlines 
the develop-deploy-debug cycle for distributed appli- 
cation management through a simple command-line 
interface where users deploy, run, monitor, and debug 
their distributed applications running on hundreds of 
remote machines. Plush combines the functionality of 
a distributed shell with the power of an application 


+ 


Run Commana 


Figure 3: Nebula Host View tab showing PlanetLab resources. This tab allows users to select multiple hosts at once 
and run shell commands on the selected resources. The text-box at the bottom shows the output from the com- 


mand. 
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controller to provide a robust execution environment 
for users to run their applications. From a user’s stand- 
point, the Plush terminal looks like a shell. Plush sup- 
ports several commands for monitoring the state of an 
execution, as well as commands for manipulating the 
application specification during execution. Table 1 
shows some of the available commands. 


Command Description 


load <file> Load application specification 

















connect <host> | Connect to host and start client 

disconnect Close all connections and clients 

info nodes Print all resource information 

info mesh Print communication fabric status 
info 

info control Print application control state info 

run Start the application (after load) 

shell <cmd> Run “cmd” on all connected re- 


sources 


Table 1: Sample Plush terminal commands. 


Programmatic Interface 

Many commands that are available via the Plush 
command-line interface are also exported via an XML- 
RPC interface to deliver similar functionality as the 
command-line to those who desire programmatic ac- 
cess. This allows Plush to be scripted and used for re- 
mote execution and automated application manage- 
ment, and also enables the use of external services for 
resource discovery, creation, and acquisition within 
Plush. In addition to the commands that Plush exports, 
external services and programs may also register them- 
selves with Plush so that the controller can send call- 
backs to the XML-RPC client when various actions oc- 
cur during the application’s execution. 


Figure 4 shows the Plush XML-RPC API. The 
functions shown in the PlushXmIlRpcServer class are 
available to users who wish to access Plush program- 
matically in scripts, or for external resource discovery 
and acquisition services that need to add and remove 
resources from the Plush resource pool. The plush 
AddNode(HashMap) and plushRemoveNode(string) calls 
add and remove nodes from the resource pool, respec- 
tively. setXmlRpcClientUri(string) registers XML-RPC 
clients for callbacks, while plushTestConnection() sim- 
ply tests the connection to the Plush server and returns 
“Hello World.” The remaining function calls in the 
class mimic the behavior of the corresponding com- 
mand-line operations. 


Aside from resource discovery and acquisition 
services, the XML-RPC API allows for the implemen- 
tation of different user interfaces for Plush. Since al- 
most all of the Plush terminal commands are available 
as XML-RPC function calls, users are free to imple- 
ment their own customized environment specific user 
interface without understanding or modifying the in- 
ternals of the Plush implementation. This is beneficial 
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because it gives the users more flexibility to develop 
in the programming language of their choice. Most 
mainstream programming languages have support for 
XML-RPC, and hence users are able to develop inter- 
faces for Plush in any language, provided that the cho- 
sen language is capable of handling XML-RPC. For 
example, Nebula is implemented in Java, and uses the 
XML-RPC interface shown in Figure 4 to interact 
with a Plush controller. To increase the functionality 
and simplify the development of these interfaces, the 
Plush XML-RPC server has the ability to make call- 
backs to programs that register with the Plush con- 
troller via setXmlRpcClientUri(string). Some of the more 
common callback functions are shown in the bottom 
of Figure 4 in class PlushXmlRpcCallback. Note that 
these callbacks are only useful if the programmatic 
client implements the corresponding functions. 


class PlushXmlRpcServer extends XmIiRpcServer { 
void plushAddNode(HashMap properties): 
void plushRemoveNode(string hostname); 
string plushTestConnection(); 
void plushCreateResources(); 
void plushLoadApp(string filename); 
void plushRunApp(); 
void plushDisconnectApp(string hostname); 
void plushQuit(); 
void plushFailHost(string hostname); 
void setXmlRpcClientUri(string clientUrl); 

} 


class PlushXmlRpcCallback extends XmIRpcClient { 
void sendPlanetLabSlices(); 
void sendSliceNodes(string slice); 
void sendAllPlanetLabNodes(); 
void sendApplicationExit(); 
void sendHostStatus(string host); 
void sendBlockStatus(string block); 
void sendResourceMatching(HashMap matching); 


Figure 4: Plush XML-RPC API. 


Implementation Details 


Plush is a publicly available software package 
consisting of over 60,000 lines of C++ code. Plush de- 
pends on several C++ libraries, including those pro- 
vided by xmlrpc-c, curl, xml2, zlib, math, openssl, 
readline, curses, boost, and pthreads. The command- 
line interface also depends on packages for lex and 
yacc (we use flex and bison). 


In addition to the main C++ codebase, Plush uses 
several simple perl scripts for interacting with the Plan- 
etLab Central database and bootstrapping resources. 
Plush runs on most UNIX-based platforms, including 
Linux, FreeBSD, and Mac OS X, and a single Plush 
controller can manage clients running on different oper- 
ating systems. The only prerequisite for using Plush on 
a resource is the ability to SSH to the resource. Current- 
ly Plush is being used to manage applications on 
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PlanetLab, ModelNet, and Xen virtual machines [5] in 
our research cluster. 


Nebula consists of approximately 25,000 lines of 
Java code. Nebula communicates with Plush using the 
XML-RPC interface. XML-RPC is implemented in 
Nebula using the Apache XML-RPC client and server 
packages. In addition, Nebula uses the JOGL imple- 
mentation of the OpenGL graphics package for Java. 
Nebula runs in any computing environment that sup- 
ports Java, including Windows, Linux, FreeBSD, and 
Mac OS X among others. Note that since Nebula and 
Plush communicate solely via XML-RPC, it is not 
necessary to run Nebula on the same physical machine 
as the Plush controller. 


Usage Scenarios 


One of the primary goals of our work is to build 
a generic application management framework that 
supports execution in any execution environment. This 
is mainly accomplished through the Plush resource ab- 
straction. In Plush, resources are computing devices 
capable of hosting applications, such as physical ma- 
chines, emulated hosts, or virtual machines. To show 


<?xml_version=""1 . O"encoding="ut£"—8?> 
<plush> 
<project name=""sword"> 
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that Plush achieves this goal, in this section we take a 
closer look at specific uses of Plush in different dis- 
tributed computing environments, including a live de- 
ployment testbed, an emulated network, and a cluster 
of virtual machines. 


PlanetLab Live Deployment 


To demonstrate Plush’s ability to manage the live 
deployment of applications, we revisit our previous 
example from the second section and show how Plush 
manages SWORD [23] on PlanetLab. Recall that 
SWORD is a resource discovery service that relies on 
host monitors running on each PlanetLab machine to 
report information periodically about their resource 
usage. This data is stored in a DHT (distributed hash 
table), and later accessed by SWORD clients to re- 
spond to requests for groups of resources that have 
specific characteristics. SWORD is a service that 
helps PlanetLab users find the best set of resources 
based on the priorities and requirements specified, and 
is an example of a long-running Internet service. 

The XML application specification for SWORD 
is shown in Figure 5. Note that this specification could 
be built using Nebula, in which case the user would 


<software_name="Sword_software" type="tar"> 
<package_name=""sword.tar" type="web"> 
<path>http://plush.ucsd.edu/sword.tar</path> 
<dest>sword.tar</dest> 
</package> 
</software> 
<component_name=""sword_participants"> 
<rspec> 
<num_hosts_min=""10" max="800"/> 
</rspec> 
<resources> 
<resource_type="planetlab" group="ucsd_sword'/> 
</resources> 
<software_name="sword_software"/> 
</component> 
<application_block_name="sword_app_block" service=""1" 
reconnect_interval=""300"> 
<execution> 
<component_block_name="participants"> 
<component_name="sSword_participants"/> 
<process_block_name=""sword"> 
<process_name=""sword_run"> 
<path>dd/planetlab/run sword</path> 
</process> 
</process_block> 
</component_block> 
</execution> 
</application_block> 


</project> 
<plush> 


Figure 5: SWORD application specification. 
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never have to edit the XML directly. The top half of 
the specification in Figure 5 defines the SWORD soft- 
ware package and the component (resource group) re- 
quired for the application. Notice that SWORD uses 
one component consisting of hosts assigned to the 
ucsd_sword PlanetLab slice. 


An interesting feature of this component defini- 
tion is the ‘“‘num_hosts” tag. Since SWORD is a ser- 
vice that wants to run on as many nodes as possible, a 
range of acceptable values is used rather than a single 
number. In this case, as long as 10 hosts are available, 
Plush will continue managing SWORD. Since the max 
is set to 800, Plush will not look for more than 800 re- 
sources to host SWORD. Since PlanetLab contains 
less than 800 hosts, this means that SWORD will at- 
tempt to run on all PlanetLab resources. 


The lower half of the application specification 
defines the application block, component block, and 
process block that describes the SWORD execution. 
The application block contains a few key features that 
help Plush react to failures more efficiently for long- 
running services. When defining the application block 
object for SWORD, we include special “‘service”’ and 
“reconnect_interval” attributes. The service attribute 
tells the Plush controller that SWORD is a long-run- 
ning service and requires different default behaviors 
for initialization and failure recovery. For example, 
during application initialization the controller does not 
wait for all participants to install the software before 
starting all hosts simultaneously. Instead, the con- 
troller instructs individual clients to start the applica- 
tion as soon as they finish installing the software, 
since there is no reason to synchronize the execution 
across all hosts. Further, if a process fails when the 
service attribute has been specified, the controller at- 
tempts to restart SWORD on that host without abort- 
ing the entire application. 


The reconnect_interval specifies the period of 
time the controller waits before rerunning the resource 
discovery and acquisition unit. For long running ser- 
vices, hosts often fail and recover during execution. 
The reconnect_interval attribute tells the controller to 
check for new hosts that have come alive since the last 
run of the resource discovery unit. The controller also 
unsets any hosts that had previously been marked as 
“failed” at this time. This is the controller’s way of 
“refreshing” the list of available hosts. The controller 
continues to search for new hosts until reaching the 
maximum num_hosts value, which is 800 in our case. 
Evaluating Fault Tolerance 

To demonstrate Plush’s ability to automatically re- 
cover from host failures for long running services, we 
ran SWORD on PlanetLab with 100 randomly chosen 
hosts, as shown in Figure 6. The host set includes ma- 
chines behind DSL links as well as hosts from other 
continents. When Plush starts the application, the con- 
troller starts the Plush client on 100 randomly chosen 
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PlanetLab machines, and they each begin downloading 
the SWORD software package (38 MB). 


It takes approximately 1000 seconds for all hosts 
to successfully download, install, and start SWORD. 
At time f= 1250s, we kill the SWORD process on 20 
randomly chosen hosts to simulate host failure. Nor- 
mally, Plush would automatically try to restart the 
SWORD process on these hosts. However, we dis- 
abled this feature to simulate host failures and force a 
rematching. The remote Plush clients notify the con- 
troller that the hosts have failed, and the controller be- 
gins to find replacements for the failed machines. The 
replacement hosts join the Plush overlay and start 
downloading the SWORD software. As before, Plush 
chooses the replacements randomly, and low band- 
width/high latency links have a great impact on the 
time it takes to fully recover from the host failure. At 
t = 2200s, the service is restored on 100 machines. 
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Figure 6: SWORD running on 100 randomly chosen 
PlanetLab hosts. At t=1250 seconds, we fail 20 
hosts. The Plush controller finds new hosts, who 
start the Plush client process and begin download- 
ing and installing the SWORD software. Service 
is fully restored at approximately t=2200 seconds. 


Using Plush to manage long-running services like 
SWORD alleviates the burden of manually probing for 
failures and configuring/reconfiguring hosts. Further, 
Plush interfaces directly with the PlanetLab Central 
API, which means that users can automatically add 
hosts to their slice and renew their slice using Plush. 
This is beneficial since services typically want to run on 
as many PlanetLab hosts as possible, including any new 
hosts that come online. In addition, Plush simplifies the 
task of debugging problems by providing a single point 
of control for all connected PlanetLab hosts. Thus, if a 
user wants to view the memory consumption of their 
service across all connected hosts, a single Plush com- 
mand retrieves this information, making it easier to 
maintain and monitor a service running on hundreds of 
resources scattered around the world. 


ModelNet Emulation 


Aside from PlanetLab resources, Plush also sup- 
ports running applications on virtual hosts in emulated 
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environments. In this section we discuss how Plush 
supports using ModelNet [29] emulated resources to 
host applications. In addition, we will discuss how a 
batch scheduler uses the Plush programmatic interface 
to perform remote job execution. 


Mission is a simple batch scheduler used to man- 
age the execution of jobs that run on ModelNet in our 
research cluster. ModelNet is a network emulation envi- 
ronment that consists of one or more Linux edge nodes 
and a set of FreeBSD core machines running a special- 
ized ModelNet kernel. The code running on the edge 
hosts routes packets through the core machines, where 
the packets are subjected to the delay, bandwidth, and 
loss specified in a target topology. A single physical ma- 
chine hosts multiple “virtual” IP addresses that act as 
emulated resources on the Linux edge hosts. 


To setup the ModelNet computing environment 
with the target topology, two phases of execution are re- 
quired: deploy and run. Before running any applications, 
the user must first deploy the desired topology on each 
physical machine, including the FreeBSD core. The de- 
ploy process essentially instantiates the emulated hosts, 
and installs the desired topology on all machines. Then, 
after setting a few environment variables, the user is free 
to run applications on the emulated hosts using virtual 
IP addresses just as applications are run on physical ma- 
chines using real IP addresses. 


A single ModelNet experiment typically con- 
sumes almost all of the computing resources available 
on the physical machines involved. Thus, when run- 
ning an experiment, it is essential to restrict access to 
the machines so that only one experiment is running at 
a time. Further, there are a limited number of FreeBSD 
core machines running the ModelNet kernel available, 
and access to these hosts must also be arbitrated. Mis- 
sion is a batch scheduler developed locally to help ac- 
complish this goal by allowing the resources to be ef- 
ficiently shared among multiple users. ModelNet users 
submit their jobs to the Mission job queue, and as the 
machines become available, Mission pulls jobs off the 
queue and runs them on behalf of the user, ensuring 
that no two jobs are run simultaneously. 


A Mission job submission has two components: 
a Plush application specification and resource directo- 
ry file. For ModelNet, the directory file contains infor- 
mation about both the physical and virtual (emulated) 
resources on which the ModelNet experiment will run. 
In the resource directory file, some entries include two 
extra parameters, “vip” and “vn”, which define the 
virtual IP address and virtual number (similar to a 
hostname) for the emulated resources. In addition to 
the directory file that is used to populate the Plush re- 
source pool, users also submit an application specifi- 
cation describing the application they wish to run on 
the emulated topology to the Mission server. 


The application specification submitted to Mis- 
sion contains two component blocks separated by a 
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synchronization barrier. The first component block de- 
scribes the processes that run on the physical ma- 
chines during the deployment phase (where the emu- 
lated topology is instantiated). The second component 
block defines the processes associated with the target 
application. When the controller starts the Plush clients 
on the emulated hosts, it specifies extra command line 
arguments that are defined in the directory file by the 
“vip” and “vn” attributes. This sets the appropriate 
ModelNet environment variables, ensuring that all 
commands run on that client on behalf of the user in- 
herit those settings. 


When a user submits a Plush application specifi- 
cation and directory file to Mission, the Mission server 
parses the directory file to identify which resources 
are needed to host the application. When those re- 
sources become available for use, Mission starts a 
Plush controller on behalf of the user using the Plush 
XML-RPC interface. Mission passes Plush the direc- 
tory file and application specification, and continues 
to interact throughout the execution of the application 
via XML-RPC. After Plush notifies Mission that the 
execution has ended, Mission kills the Plush process 
and reports back to the user with the results. Any ter- 
minal output that is generated is emailed to the user. 


Plush jobs are currently being submitted to Mis- 
sion on a daily basis at UCSD. These jobs include ex- 
perimental content distribution protocols, distributed 
model checking systems, and other distributed appli- 
cations of varying complexity. Mission users benefit 
from Plush’s automated execution capabilities. Users 
simply submit their jobs to Mission and receive an 
email when their task is complete. They do not have to 
spend time configuring their environment or starting 
the execution. Individual host errors that occur during 
execution are aggregated into one message and re- 
turned back to the user in the email. Logfiles are col- 
lected in a public directory on a common file system 
and labeled with a job ID, so that users are free to in- 
spect the output from individual hosts if desired. 


Virtual Machine Deployment 


In all of the examples discussed above, the pool 
of resources available to Plush is known at startup. In 
the PlanetLab examples, Plush uses slice information 
to determine the set of user-accessible hosts. For Mod- 
elNet, the emulated topology includes specific infor- 
mation about the virtual hosts to be created and this 
information is passed to Plush in the directory file. We 
next describe how Plush manages applications in envi- 
ronments without fixed sets of machines, but rather 
underlying capabilities to create and destroy resources 
on demand. 


Shirako [16] is a utility computing framework. 
Through programmatic interfaces, Shirako allows users 
to create dynamic on-demand clusters of resources, in- 
cluding storage, network paths, physical servers, and vir- 
tual machines. Shirako is based on a resource leasing 
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abstraction, enabling users to negotiate access to re- 
sources. Usher [21] is a virtual machine scheduling sys- 
tem for cluster environments. It allows users to create 
their own virtual machines or clusters. When a user re- 
quests a virtual machine, Usher uses data collected by 
virtual machine monitors to make informed decisions 
about when and where the virtual machine should run. 


We have extended Plush to interface with both 
Shirako and Usher. Through its XML-RPC interface, 
Plush interacts with the Shirako and Usher servers. As 
resources are created and destroyed, the resource pool 
in Plush is updated to include the current set of leased 
resources. Using this dynamic resource pool, Plush 
manages applications running on potentially tempo- 
rary virtual machines in the same way that applica- 
tions are managed in static environments like Planet- 
Lab. Thus, using the resource abstractions provided by 
Plush, users are able to run their applications on Plan- 
etLab, ModelNet, or on clusters of virtual machines 
without ever having to worry about the underlying de- 
tails of the environment. 


To support dynamic resource creation and man- 
agement, we augment the Plush application specifica- 
tion with a description of the desired virtual machines 
as shown in Figure 7. Specifically, the Plush applica- 
tion specification needs to include information about 
the desired attributes of the resources so that this in- 
formation can be passed on to either Shirako or Usher. 
Shirako and Usher currently create Xen [5] virtual ma- 
chines (as indicated by the “type” flag in the resource 
description) with the CPU speed, memory, disk space, 
and maximum bandwidth specified in the resource re- 
quest. As the Plush controller parses the application 


<?xml_version="1 . 0" encoding="ut f£-8"?> 
<plush> 
<project_name="simple"> 
<component_name="Group1"> 
<rspec> 
<num_hosts>10</num_hosts> 
<shirako> 
<num_hosts>10</num_hosts> 
<type>1</type> 
<memory>200</memory> 
<bandwidth>200</bandwidth> 
<cpu>50</cpu> 
<lease_length>600</lease_length> 
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specification, it stores the resource description. Then 
when the create resource command is issued either via 
the terminal interface or programmatically through 
XML-RPC, Plush contacts the appropriate Shirako or 
Usher server and submits the resource request. Once 
the resources are ready for use, Plush is informed via 
an XML-RPC callback that also contains contact in- 
formation about the new resources. This callback up- 
dates the Plush resource pool and the user is free to 
start applications on the new resources by issuing the 
run command to the Plush controller. 


Though the integration of Plush and Usher is still 
in its early stages, Plush is being used by Shirako 
users regularly at Duke University. While Shirako 
multiplexes resources on behalf of users, it does not 
provide any abstractions or functionality for using the 
resources once they have been created. On the other 
hand, Plush provides abstractions for managing dis- 
tributed applications on remote machines, but provides 
no support for multiplexing resources. A “‘resource” 
is merely an abstraction in Plush to describe a machine 
(physical or virtual) that can host a distributed applica- 
tion. Resources can be added and removed from the 
application’s resource pool, but Plush relies on exter- 
nal mechanisms (like Shirako and Usher) for the cre- 
ation and destruction of resources. 


The integration of Shirako and Plush allows 
users to seamlessly leverage the functionality of both 
systems. While Shirako provides a web interface for 
creating and destroying resources, it does not provide 
an interface for using the new resources, so Shirako 
users benefit from the interactivity provided by the 
Plush shell. Researchers at Duke are currently using 


<server>http://shirako.cs.duke.edu:20000</server> 


</shirako> 
</rspec> 
<resources> 


<resource_type="ssh" group="shi rako'/> 


</resources> 
</component> 
</project> 
</plush> 


Figure 7: Plush component definition containing Shirako resources. The resource description contains a lease pa- 
rameter which tells Shirako how long the user needs the resources. 
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Plush to orchestrate workflows of batch tasks and per- 
form data staging for scientific applications including 
BLAST [3] on virtual machine clusters managed by 
Shirako [14]. 


Related Work 


The functionality that Plush provides is related to 
work in a variety of areas. With respect to remote job 
execution, there are several tools available that pro- 
vide a subset of the features that Plush supports, in- 
cluding cfengine [9], gexec [10], and vxargs [20]. The 
difference between Plush and these tools is that Plush 
provides more than just remote job execution. Plush 
also supports mechanisms for failure recovery, and au- 
tomatic reconfiguration due to changing conditions. In 
general, the pluggable aspect of Plush allows for the 
use of existing tools for actions like resource discov- 
ery and allocation, which provides more advanced 
functionality than most remote job execution tools. 


From the user’s point of view, the Plush com- 
mand-line is similar to distributed shell systems such 
as GridShell [31] and GCEShell [22]. These tools pro- 
vide a user-friendly language abstraction layer that 
support script processing. Both tools are designed to 
work in Grid environments. Plush provides a similar 
functionality as GridShell and GCEShell, but unlike 
these tools, Plush works in a variety of environments. 


In addition to remote job execution tools and dis- 
tributed shells, projects like the PlanetLab Application 
Manager (appmanager) [15] and SmartFrog [13] focus 
specifically on managing distributed applications. app- 
manager is a tool for maintaining long running ser- 
vices and does not provide support for short-lived exe- 
cutions. SmartFrog [13] is a framework for describing, 
deploying, and controlling distributed applications. It 
consists of a collection of daemons that manage dis- 
tributed applications and a description language to de- 
scribe the applications. Unlike Plush, SmartFrog is a 
not a turnkey solution, but rather a framework for 
building configurable systems. Applications must ad- 
here to a specific API to take advantage of Smart- 
Frog’s features. 


There are also several commercially available 
products that perform functions similar to Plush. 
Namely, Opsware [24] and Appistry [4] provide soft- 
ware solutions for distributed application manage- 
ment. Opsware System 6 allows customers to visual- 
ize many aspects of their systems, and automates soft- 
ware management of complex, multi-tiered applica- 
tions. The Appistry Enterprise Application Fabric 
strives to deliver application scalability, dependability, 
and manageability in grid computing environments. In 
comparison to Plush, both of these tools focus more 
on enterprise application versioning and package man- 
agement, and less on providing support for interacting 
with experimental distributed systems. 


The Grid community has several application man- 
agement projects with goals similar to Plush, including 
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Condor [8] and GrADS/vGrADS [7]. Condor is a work- 
load management system for compute-intensive jobs 
that is designed to deploy and manage distributed exe- 
cutions. Where Plush is designed to deploy and manage 
naturally distributed tasks with resources spread across 
several sites, Condor is optimized for leveraging under- 
utilized cycles in desktop machines within an organiza- 
tion where each job is parallelizable and compute- 
bound. GrADS/vGrADS [7] provides a set of program- 
ming tools and an execution environment for easing 
program development in computational grids. GrADS 
focuses specifically on applications where resource re- 
quirements change during execution. The task deploy- 
ment process in GrADS is similar to Plush. Once the ap- 
plication starts execution, GrADS maintains resource re- 
quirements for compute intensive scientific applications 
through a stop/migrate/restart cycle. Plush, on the other 
hand, supports a far broader range of recovery actions. 


Within the realm of workflow management, there 
are tools that provide more advanced functionality 
than Plush. For example, GridFlow [11], Kepler [19], 
and the other tools described in [32] are designed for 
advanced workflow management in Grid environ- 
ments. The main difference between these tools and 
Plush is that they focus solely on workflow manage- 
ment schemes. Thus they are not well suited for man- 
aging applications that do not contain workflows, such 
as long-running services. 


Lastly, the Globus Toolkit [12] is a framework 
for building Grid systems and applications, and is per- 
haps the most widely used software package for Grid 
development. Some components of Globus provide a 
similar functionality as Plush. With respect to our ap- 
plication specification language, the Globus Resource 
Specification Language (RSL) provides an abstract 
language for describing resources that is similar in de- 
sign to our language. The Globus Resource Allocation 
Manager (GRAM) processes requests for resources, 
allocates the resources, and manages active jobs in 
Grid environments, providing much of the same func- 
tionality as Plush does. The biggest difference be- 
tween Plush and Globus is that Plush provides a user- 
friendly shell interface where users directly interact 
with their applications. Globus, on the other hand, is a 
framework, and each application must use the APIs to 
create the desired functionality. In the future, we plan 
to integrate Plush with some of the Globus tools, such 
as GRAM and RSL. In this scenario Plush will act as a 
front-end user interface for the tools available in 
Globus. 


Conclusion 


Plush is an extensible application control infra- 
structure designed to meet the demands of a variety of 
distributed applications. Plush provides abstractions 
for resource discovery and acquisition, software instal- 
lation, process execution, and failure recovery in dis- 
tributed environments. When an error is detected, 
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Plush has the ability to perform several application- 
specific actions, including restarting the computation, 
finding a new set of resources, or attempting to adapt 
the application to continue execution and maintain 
liveness. In addition, Plush provides new relaxed syn- 
chronization primitives that help applications achieve 
good throughput even in unpredictable wide-area con- 
ditions where traditional synchronization primitives 
are too strict to be effective. 


Plush is in daily use by researchers worldwide, and 
user feedback has been largely positive. Most users find 
Plush to be an “extremely useful tool” that provides a 
user-friendly interface to a powerful and adaptable appli- 
cation control infrastructure. Other users claim that 
Plush is ‘flexible enough to work across many admin- 
istrative domains (something that typical scripts do not 
do).”’ Further, unlike many related tools, Plush does 
not require applications to adhere to a specific API, 
making it easy to run distributed applications in a vari- 
ety of environments. Our users tell us that Plush is 
“fairly easy to get installed and setup on a new ma- 
chine. The structure of the application specification 
largely makes sense and is easy to modify and adapt.” 


Although Plush has been in development for 
over three years now, we still have some features that 
need improvement. One important area for future en- 
hancements is error reporting. Debugging applications 
is inherently difficult in distributed environments. We 
try to make it easier for researchers using Plush to lo- 
cate and diagnose errors, but this is a difficult task. For 
example, one user says that ““when things go wrong 
with the experiment, it’s often difficult to figure out 
what happened. The debug output occasionally does 
not include enough information to find the source of 
the problem.” We are currently investigating ways to 
allow application specific error reporting, and ulti- 
mately simplify the task of debugging distributed ap- 
plications in volatile environments. 
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ABSTRACT 


We have pioneered the deployment of EverLab, a production level private PlanetLab system 
using high-end clusters spread over Europe. EverLab supports both experimentation and computa- 
tional work, incorporating many of the features found on Grid systems. This paper describes the 
decision process that led us to choose PlanetLab and the challenges that we faced during our im- 
plementation and production phases. We detail the monitoring systems that were deployed on Ev- 
erLab and their impact on our management policies. The paper concludes with suggestions for fu- 
ture work on private PlanetLabs and federated systems. 


Introduction 


Evergrow is a European Commission Sixth 
Framework Integrated Project with around 28 partici- 
pating research organizations spread across Europe 
and the Middle East. The project combines research 
efforts including network measurement [11, 17], dis- 
tributed systems [1], and complex systems research 
[9]. Our researchers span the range from systems de- 
velopers to physicists. At the onset of the project, we 
realized a need to provide computational and experi- 
mental tools for our members. We purchased a set of 
eight IBM HS20 clusters co-located with some of our 
research members. Each cluster has 16 blades, where 
one blade is a storage server and another blade is used 
for configuration management. We allocated support 
expenses to the hosting members in order to provide 
administration, maintenance and support to our users. 


Intentionally, the clusters were spread across 
eight European facilities: Aston University — UK, Uni- 
versite Paris-Sud 11 “Orsay” (UPSXI) — France, Istituto 
Nazionale per la Fisica della Materia (INFM) — Rome, 
Italy, Collegium Budapest Egyesulet (COLBUD) — Hun- 
gary, Tel Aviv University (TAU) — Israel, Otto-von-Gu- 
ericke-Universitat at Magdeburg (UNI MD) — Germany, 
Universite Catholique de Louvain (UCL) — Belgium and 
Swedish Institute of Computer Science (SICS) — Swe- 
den. The clusters were deployed in different sites so that 
we could run “real-world” network experiments using 
existing wide area network links. 


In September of 2004, the team met to decide 
how to integrate the clusters into a shared resource. It 
was agreed to setup a VPN between the clusters with a 
master LDAP server for authentication. We intended 
to use IBM’s GPFS file system to make our storage 
available throughout the cluster. 

For various reasons, our agreed upon approach 
was not implemented. The reasons included technical 
problems, networking policies and the availability of 


local resources. One year later in September of 2005, 
we undertook a survey of our clusters. We found a 
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Figure 1: Geographic location of EverLab ites,” 





very sad state of affairs. Each local administrator had 
chosen a different stand-alone implementation for 
their cluster. Initially the clusters had RedHat EL2 in- 
stalled. Some of the local system administrators changed 
the operating system to Fedora Core 4, Debian, Ubunto 
or Mosix. Many of the clusters were inaccessible be- 
cause the local network policy forbade open access to 
“internal” computational resources. At one point, we 
had eight different operating systems. No part of the 
original plan was universally implemented and hence our 
researchers had to request permission from each admini- 
strator both for a login id and for a firewall exemption so 
that they could access the nodes. 
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Our first task was to deploy a monitoring system 
for all nodes in the eight clusters. We found that Gan- 
glia [10] was easy to install and required only a small 
number of changes to existing network policies. Gan- 
glia gave us our first view into what the clusters were 
doing. The results were disappointing. Many of the 
nodes were idle. 


Having realized that our current approach was 
not working, we started looking at alternatives. One 
option was to force all the administrators to adopt a 
standard platform. This was rejected because each do- 
main had their “‘standard’’ platform, be it RedHat, Fe- 
dora Core, Debian or BSD. The administrators did not 
want to be responsible for an unfamiliar system. The 
other option was to find a standard platform that could 
be administered centrally, relieving the local adminis- 
trators from direct interaction with the installed oper- 
ating system. Two options were suggested: Grid and 
PlanetLab. A comparison of these two approaches can 
be found in [21]. 


We explored using a Grid infrastructure [3]. Grid 
systems are designed for computation and could have 
been deployed across our clusters. Grid environments 
are reasonably well developed. Such a system would 
have provided a unified login, the ability to deploy ap- 
plications across the nodes and a strong monitoring 
and management infrastructure. The problem was that 
the Grid is optimized for computation. Applications 
are automatically deployed to available nodes. A large 
fraction of our researchers wanted to perform experi- 
ments where the location of a process is important. 
When debugging network experiments, it is important 
to be able to run test scripts and tracing programs di- 
rectly on each remote node. The Grid infrastructure 
does not allow this kind of access. Grid computation 
nodes are accessible to users only through the Grid 
management system. 


At the time PlanetLab [18] had been deployed to 
around 500 nodes across the world. PlanetLab sup- 
ports network experimentation across remote distrib- 
uted nodes. The system is centrally managed from 
Princeton University in the United States. PlanetLab 
itself was not a complete solution for two reasons; 
first, PlanetLab is designed only for network experi- 
mentation. A significant fraction of Evergrow re- 
searchers needed computational resources. Secondly, 
at the time, there were no production level Private 
PlanetLab installations. At October 2005, we initiated 
a European PlanetLab workshop in EPFL, Switzerland 
[16]. We found out there was significant interest at 
both educational institutions and in industry for imple- 
menting and using Private PlanetLabs to share and 
manage remote resources. 


In December 2005, we took up the challenge of 
implementing PlanetLab on the Evergrow clusters. We 
called the new system EverLab. The path was treach- 
erous. At the time, the PlanetLab software was not 
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designed for ease of installation. It was a moving tar- 
get with components being rewritten and upgraded on 
a regular and unannounced basis. PlanetLab was de- 
signed for “low end” computers. It ran on single pro- 
cessor from the Pentium family with 512 MB or 
RAM, a CDROM and 50 GB of local disk and direct 
connect keyboard connected to a central USB BUS. 
Our cluster blades have dual 3 Ghz Xeon processors 
with 4 GB of local ram, 80 GB of disk, no CDROM 
and a USB keyboard. It took many months to identify 
the problems and build the appropriately modified ker- 
nels and support files. 


We succeeded in deploying a Private PlanetLab 
system. The system provided a centrally managed 
platform that was usable for experimentation. We in- 
stalled Ganglia on EverLab so that we could monitor 
both the old and new systems from one platform. We 
also developed a custom-built resource reporting sys- 
tem called EverStats. Together, these tools allow our 
administrators to track system usage and to identify 
network and hardware problems. Ganglia told us 
which machines were in use and EverStats told us who 
used our system and how each node was allocated. 


At this point, EverLab was operational and us- 
able by all Evergrow researchers, but it did not yet 
support High Performance Computation (HPC). To 
this end, we deployed the Condor [19] system from 
the University of Wisconsin. We deployed the Mes- 
sage Passing Interface (MPICH2) [5] on top of Con- 
dor. With these two additions, Everlab now supports 
both computation and networking research. 


As far as we know, we were the first production- 
quality Private PlanetLab. Unlike the original Planet- 
lab network which is mainly based on regular PC 
computers, our network is based on high-end servers 
with Gigabit Internet connection. 


EverLab runs on a subset of the 112 EverGrow 
nodes. It currently includes more than 50 blades in six 
clusters. All Evergrow researchers can create an ac- 
count on EverLab and have quick access to the re- 
sources without negotiating with eight separate admin- 
istrative domains. EverLab is monitored 24x7, and 
problems are quickly identified. The EverLab admin- 
istrators can handle system level issues. Local admin- 
istrators respond to hardware related problems. 


The rest of this paper describes the challenges 
and solutions that we encountered during this journey. 
We detail the value of Ganglia and EverStats to our 
administration efforts. There are still many opportuni- 
ties to improve and extend EverLab some of which are 
described in the Future Work section. Finally, we con- 
clude with our lessons and opinions about the use of 
EverLab type systems for new research projects. 


PlanetLab 


Everlab is based on PlanetLab version 3.2 [18]. 
This section describes the PlanetLab implementation. 
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Overall, we found PlanetLab to be a very stable plat- 
form once the installation process and initial settings 
were completed. 


PlanetLab 
WEB PLO 
Server API 

—— RPM 


PostgreSQL DB 








| | 
Slice | Slice |, Slice | 
_ | 7 i 
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Figure 2: Schematic diagram of the PlanetLab net- 
work components. 


PlanetLab is a centrally managed collection of 
distributed computers which are called nodes. The 
system is designed to be used on a publicly accessible 
network where all nodes can at a minimum access the 
central management node. The central management 
node or PlanetLab Central (PLC) supplies three func- 
tions: database for storing system state, web interface 
for management and a RPM [4] repository for updat- 
ing the remote nodes. The web interface provides a 
human readable interface and an XMLRPC interface 
called PLCAPI for internal use. Remote nodes com- 
municate with the PLC through HTTPS and PLCAPI 
calls. 


At the time of our initial efforts, The PlanetLab 
Central node ran on Fedora Core 2 (FC2). The PLC 
used a PostgreSQL database and an Apache web serv- 
er. Most of PlanetLab was implemented in a mixture 
of shell and python. The use of common open-source 
components was a significant factor in our decision to 
implement PlanetLab. We felt that we could under- 
stand, maintain and modify any or all of the compo- 
nents as needed. 


Each PlanetLab Version 3 node consists a modi- 
fied FC2 system. All user activity is performed using 
virtualization technology implemented by the vserver 
[8] kernel extension. Node installation is intentionally 
kept as simple and minimal as possible both to reduce 
complexity and to increase security. PlanetLab’s virtual- 
ization unit is called a slice. Each slice is a minimal FC2 
installation. A user logged into a slice has slice level 
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superuser privileges through the sudo command. Slices 
provide compartmentalization between users and system 
components, thus reducing or eliminating the possibility 
of one user modifying or removing a file or component 
necessary to another user or process. 


PlanetLab Security 


PlanetLab was designed from the outset as a plat- 
form for network experimentation. PlanetLab nodes need 
free access to and from the Internet in order to provide 
the broadest possible research opportunities and to limit 
unexpected network interactions caused by firewalls or 
local network policies. This focus impacted many of the 
installation, administration and security aspects of Plan- 
etLab. 


PlanetLab utilizes asymmetric encryption keys to 
create secure authenticated communication channels. 
These keys are used to identify nodes, servers, and 
users within the system. There are unique keys for 
run-time and debug mode operations. 


All nodes are assumed to be at risk. Even with the 
strong compartmentalization provided by the vserver 
slices, in principle an attacker could enter the root do- 
main and take over the machine. To minimize this expo- 
sure and to provide a recovery mechanism from a pos- 
sible penetration, PlanetLab initially boots from a 
CDROM. The CDROM contacts the PLC and can ei- 
ther enter a debug mode, boot to the exiting disk based 
kernel, or re-install the node. 


Node 


PLC 








Figure 3: Planetlab boot process. (1) Everlab node 
boots from CD-ROM. (2) Node gets certificate 
and identity from floppy drive. (3) Authentica- 
tion process is done via the PLC. (4) Local files 
are updated from PLC. (5) Node bootstraps into 
VServer kernel. 


The PlanetLab kernel includes a secure Ping Of 
Death (POD) implementation which allows the PLC to 
cause the kernel to reboot given an encrypted secret 
that only the PLC could have produced. 


If a node is suspected of having been compro- 
mised, a PlanetLab administrator can cause the node 
to reboot using the Ping Of Death and to reboot from 
CDROM into debug mode. At this point, the admini- 
strator can log into the node using a special debug 
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mode SSH key. The administrator can mount the local 
disk, examine the files and determine if the machine is 
worth saving. At any point, the administrator can set 
the nodes status to “reinstall”. On the next reboot, the 
CDROM based kernel will wipe the disks clean and 
install a clean system from scratch. 


In the 18 months that we have run PlanetLab 
nodes on the Internet, we have never had a known pene- 
tration. We have used both the POD and the Reinstall 
option to recover from hardware and software errors. 
NetFlow: Security Monitoring and Logging 

PlanetLab includes a package called NetFlow on 
each node. The PlanetLab Netflow component is 
based on the Netfilter [12] ulogd package. This pack- 
age tracks all network flows, i.e., communications be- 
tween this node and all other nodes. The data is avail- 
able over HTTP from port 80 on each node. The flow 
traces are very useful in determining which slice was 
responsible for communication to a given node. This 
can be helpful when debugging an application or ex- 
periment. It can also be used when we suspect that the 
node has been compromised either by an external par- 
ty or a rogue experiment. 


Installation Issues 


PL_BOX 

The first efforts to deploy Private PlanetLabs 
were through a package called ‘“‘pl_box:” PlanetLab 
in a Box. The package consists of a set of scripts 
which can download and install all necessary Planet- 
Lab components. The local machine is installed as a 
PlanetLab Central (PLC) and separate scripts are pro- 
vided to create deployment CDROM’s and kernels. 
The PLC installation copies the necessary RPM files 
to a local directory for use when installing PlanetLab 
nodes. These RPMs include the FC2 package as well 
as separate PlanetLab packages. 


The pl_box package generates all public and pri- 
vate keys, installs the databases and web applications 
and creates the necessary cron jobs to keep PlanetLab 
running. 

For our implementation needs, there were two 
major drawbacks to the pl_box system. First, the in- 
stalled system is a copy of the PlanetLab system. All 
the web pages, documents and embedded links point 
back to the original PlanetLab system instead of the 
newly installed Private PlanetLab. Secondly, there is 
no upgrade path for pl_box. Changes made to the orig- 
inal PlanetLab system must be manually imported into 
the private pl_box. At the time, there was no mecha- 
nism for change notification and so the public Planet- 
Lab and the private pl_box system were guaranteed to 
diverge. 

Even with these issues, we have found pl_box to 
be sufficiently stable for our needs. We have had no 
serious issues since our deployment more than 18 
months ago. The PlanetLab project has since replaced 
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pl_box with a new system called MyPLC. The My- 
PLC system offers the ability to customize the user in- 
terface for the private installation. It provides a up- 
grade mechanism through the use of standard RPM 
source and binary packages available from the Planet- 
Lab development team. There is no formal upgrade 
path from pl_box to MyPLC and so we will need to 
re-install our entire system and re-implement our ex- 
tensions. None-the-less, we believe that MyPLC rep- 
resents the future of private PlanetLabs and we plan 
on upgrading to the new system sometime this year. 


Our first challenge was to install pl_box. We 
chose to use a Fedora Core 4 platform for this pur- 
pose. At the time, Thierry Parmentelat at Inria in 
France showed that it was possible to run a PLC on 
FC4. We decided to install on FC4, given that it was a 
fresher, more secure release than the default FC2. The 
installation and production challenges revolved around 
changes to core packages such as PHP that pl_box re- 
quired. Debugging these differences gave us our first 
understanding of the core functionality. 


Once the PLC was installed, we moved to in- 
stalling new nodes. We started with two local ma- 
chines that had already been PlanetLab nodes. We had 
no trouble installing these two machines and were 
now ready to deploy to the clusters. It was here that 
our real problems began. 

Cluster Ownership 

We spent many month prototyping and experi- 
menting with PlanetLab to determine its appropriateness 
for our installations. At the EverGrow general assembly 
meeting in December 2005, we presented our results 
and were given the green light to deploy EverLab on the 
group’s clusters. We then went to each local site admini- 
strator and asked for access to deploy the new system. 
We were met with three types of responses. 


Some administrators welcomed us with open 
arms. We were going to reduce their overhead by man- 
aging all of cluster’s software and user issues. These 
systems were converted to EverLab as soon as the ex- 
isting researchers had finished their ongoing experi- 
ments. 


Some site administrators had integrated one or 
more of their cluster’s nodes into research workflows. 
These nodes became dedicated to that project and 
were unavailable to EverLab. At these sites, EverLab 
was installed on most but not all of the nodes. 


Finally, one cluster never made the transition to 
EverLab. This group had installed a load balancing 
version of Linux and were able to keep all of the clus- 
ter’s node fully loaded more than 90% of the time. It 
was decided that there was little benefit to be had by 
moving these nodes to EverLab. 


Network Politics 


The Evergrow nodes were deployed to eight uni- 
versities across Europe. Each university had and has 
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their own network policies. Some of our host universi- 
ties were already hosting PlanetLab nodes. 


We had to fight the network battle at each sepa- 
rate location. PlanetLab minimally requires that: 
1. Each node has a public DNS entry. 
2. Each node has unhindered access to the PLC. 
3. The PLC can send packets to each node for the 
Ping of Death feature. 


In addition, we hoped that all machines would 
have unhindered access to and from the Internet. 


We negotiated with the local system administra- 
tors and they negotiated with their local network ad- 
ministrators. In most cases, we were able to get the 
nodes physically located on a public Internet. The ne- 
gotiations sometimes required the signatures of uni- 
versity officials or worse, university security officers. 
All told, it took many months to simply gain access to 
the remote clusters. 


Hardware 


The closest cluster to our developers was in Tel- 
Aviv University (TAU). Unfortunately, TAU has very 
restrictive access policies and to this day, provides on- 
ly restricted access to their EverLab nodes. In contrast, 
The Swedish Institute of Computer Science (SICS) 
had nodes directly connected to the Internet and were 
eager to help with our new system. We decided to start 
testing with the SICS cluster. 


Our first challenge was to replace the CDROM 
based bootstrap process used in PlanetLab nodes. We 
wanted to maintain the bootstrap ability, but our 
blades did not have a dedicated CDROM drive or 
USB drive. We needed to have our nodes boot from 
the network. We discarded the option of booting from 
the PLC because of the significant network delays and 
limited bandwidth between our clusters in Europe and 
the PLC in Jerusalem. This left the option of booting 
from a local node. 


Each cluster included a management node that 
was originally designed to provide network boot over 
DHCP and PXEboot [7]. PXEBoot usually downloads 
a kernel to the target node. The node boots the kernel 
in diskless mode and uses NFS to mount the root par- 
tition from the boot server. We wanted our root parti- 
tion to be read-only and to be unique for each node. 
We could have created a separate read-only directory 
on the management node and mounted it on each boot 
node. Instead we choose to incorporate the complete 
root partition into an initrd file. This file is then down- 
loaded to each node as it boots. The EverLab initrd file 
contains the complete content of the Boot CD. In a 
standard system, each PlanetLab Boot CD references a 
diskette which contains the private keys for that ma- 
chine. PlanetLab allows an administrator to put these 
keys into the CDROM itself, thus having a custom 
CDROM for each node. 


Our nodes do not have a diskette, thus we needed 
a custom initrd for each node. We created a generic 
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initrd and wrote script that copies the generic initrd to a 
custom file and installs the private keys. This script is 
then run once for each target node on the local cluster. 


Local 


PXEBoot Node 













ee 
Boot Kernel i SafS 






Node 1 Initrd 
Node 2 Initrd 





Figure 4 Everlab PXE boot solution. (1) Boot kernel 
and initrd downloaded from local PXE boot server 
via PXEBoot. (2,3) Authentication process is 
done via the PLC and local files are updated. Fi- 
nally, node bootstraps into VServer kernel. 


The blades have two Ethernet Network Interface 
Cards (NICs). We choose to use one NIC for the boot- 
strap process and the second for the public Internet. 
We use the private NIC only during the boot process 
and leave it un-configured during normal EverLab op- 
erations. Slices are unable to configure the NICs. The 
main benefit is that EverLab users have no way to at- 
tack or even to see the bootserver. We believe that this 
increases the probability that the boot image remains 
intact. Our approach is not as good as a read-only 
CDROM, but we feel that it strikes the right balance 
between security and complexity for our system. 


Once we had the PXEboot and initrd process 
working, we were able to boot the default PlanetLab 
kernel. Unfortunately, our blades were newer than the 
supported PlanetLab nodes. The running nodes had no 
network drivers and no keyboard controller. Each 
blade has a USB keyboard, but the default PlanetLab 
nodes did not install the appropriate drivers. Similarly, 
the blades used a network card that was not available 
when FC2 was first released. We were left with a node 
that was clearly running something, but that was deaf 
and blind. 


Through trial and error we were able to identify 
the appropriate drivers and configuration files and to 
rebuild our initrd files. This process took the better part 
of a month, but we were finally able to debug and boot 
our nodes. 


Management Challenges 


Our intention was to develop a system that could 
be managed remotely. The PlanetLab system provided 
most of that functionality. PlanetLab defines four 
types of capabilities: 

1. Administrators These users can perform all 
PlanetLab operations including Ping of Death 
and re-install. They can enable or disable fea- 
tures and capabilities for other users. 

2. Principal Investigators These users are respon- 
sible for the use of a set of nodes. They can 
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enable or disable access for their students and 
can create slices. 

3. User Can deploy a slice to one or more nodes 
and can log into those nodes. 

4. Tech These users can administer their local site, 
performing admin like functions only on those 
nodes. 


We use the standard PlanetLab definitions, but 
allocated each of the local system administrators with 
Principal Investigator and Tech capabilities. These 
system administrators are then asked to enable ac- 
counts for researchers in their institutions. Researchers 
that do not have a local cluster administrator are man- 
aged by the central EverLab administration team. 


Hyperthreading Performance Issues 


During the decision process, our computation 
based researchers wereconcerned that the PlanetLab 
infrastructure would require a significant fraction of 
each node’s CPU cycles. We were able to show that on 
our hardware, the difference between a program run- 
ning on a stock kernel and one running in a slice under 
PlanetLab was less than 3%. This experiment was not 
particularly scientific, but it was sufficient to assuage 
the fears of the HPC researchers. 


Our nodes came with two Intel Xeon 3.06 Ghz 
processors that support Intel’s Hyper-threading Tech- 
nology (HTT) [22] . In theory Hyper-Threading Tech- 
nology should provide a performance boost of up to 
30%. We found that this was true only on heavily 
loaded server systems running a number of CPU-in- 
tensive processes that is larger than the number of in- 
stalled physical CPUs. In our workloads, we typically 
have only as many compute tasks as CPUs. 


The Linux kernel views each hyper-threaded pro- 
cessor as two virtual processors. When a task is 
runnable, it is assigned to one of the virtual processors. 
Ideally, since we have two physical processors per 
node, the kernel should schedule each task to a different 
physical processor. Unfortunately, we found that in 
many instances, the kernel scheduled both CPU inten- 
sive tasks to virtual processors on the same physical 
processor. The resulting cache misses and contention 
significantly reduced our systems performance. 


In light of this finding, we have turned off 
Hyper-threading on most of our nodes. 
Stability 

IBM describes the HS20 cluster as: ‘Powerful 
2-way Intel processor-based blade server delivers un- 
compromising performance for your mission-critical 
applications.” [6] We purchased this equipment in 
2003/2004 from IBM because we valued IBM’s repu- 
tation for reliability and service. We have since had 
cause to regret this decision. 

Each of our blades came mounted with two 
Toshiba MK4019GAXB [20] 40 GB disk drives. 
These drives are 2.5” hard disk drives of the type 
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found in many laptops. IBM probably choose to use 
2.5” drives because of the size restrictions imposed 
when fitting two drives on each blade. 


In the interest of reducing disk management, 
each PlanetLab node mounts its own local disks as a 
single large logical volume. This construct enables the 
system to allocate disk space without concern for the 
size of each partition or disk drive. The drawback is 
that if any of the drives fail, the whole partition is cor- 
rupted and lost. Recall that in PlanetLab, it is easy to 
re-install a node. The loss of any one node is pretty 
trivial and the resources can be easily recovered (al- 
though the data on that node is lost). 


At installation time, the Evergrow clusters had a 
total of 244 MK4019GAXB drives. In the first three 
years of ownership, we have replaced more than 50 of 
these drives. By far, the most common problem and 
the largest management headache has been the failure 
of disk drives. We have not identified any common 
cause behind these failures. Some of our clusters are 
in very professional data centers with significant cool- 
ing capacity and managed power. Other clusters are in 
less professional locations with limited cooling and 
whatever power is available from the wall socket. Lo- 
cation does not seem to be a factor in the failures. We 
can only conjecture that these drives were defective in 
one way or another. There have been no failures with 
any of the non-MK4019 replacement drives. 


Monitoring 


Monitoring of resources is an important part of 
any research project. At a minimal level, monitoring 
tools identify software, systems and hardware that 
may not be operating as expected. One level up, moni- 
toring provides a list of assets that administrators and 
users can reference to identify and find available re- 
sources. At the management level, the project coordi- 
nators can watch the monitoring systems to identify 
users that are utilizing the projects resources. This data 
also delineates those registered users who are not us- 
ing the system. 

Ganglia 

We began to wonder about utilization of our new 
system about six months after our partners received 
their cluster hardware. We knew that each partner had 
deployed the systems, but we had no idea if the clus- 
ters were in use or even if they were operational. Our 
very first effort was to implement a centralized moni- 
toring system using the Ganglia [10] software pack- 
age. Ganglia provided at-a-glance status for each clus- 
ter and each node. 


Ganglia implements both push and pull commu- 
nications. Each monitored node runs a ganglia daemon 
called gmond which intermittently collects informa- 
tion about the local system state. Each gmond daemon 
sends a regular update to a gmeta collector daemon. 
The gmetad daemons maintain a list of all clients that 
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sent it data along with the details for those clients. We 
have one gmetad daemon for each cluster and a sepa- 
rate one for the EverLab system. In our implementa- 
tion, a central gmetad daemon running on our main 
server polls each of the gmetad daemons on a regular 
basis and collects a snapshot of that daemons stored 
status. The collected data is then stored in RRD [13] 
databases for graphing and presentation. 


We keep a web browser open to our local Gan- 
glia monitor. At a glance, we can see which clusters 
are operational, busy, or down. Ganglia provides sum- 
mary statistics of Total, Up and Down hosts which 
quickly reflect the overall system health. 


EverLab Cluster Load last year 


Load/Procs 
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Figure 5: May 2006 to May 2007 EverLab One 
Minute Load Average. 


The main Ganglia load graphs display the 1 
minute load, the number of nodes, the number of 
CPUs and the number of running processes. An opti- 
mally utilized system would have one process for each 
CPU. The Everlab One Minute Load Average graph 
shows the 1 minute load on the EverLab system be- 
tween May 2006 and May 2007. The line at the top of 
the graph shows the total number of CPUs, some of 
which are actually virtual hyperthreaded CPUs. The 
second line shows the number of reported nodes and 
the third line shows the number of running processes. 


As can be seen, the number of nodes changes 
over time. The majority of these outages are related to 
power and cooling problems in the remote data cen- 
ters. The graph also shows the growth in EverLab us- 
age. Initial use was minimal until after the EverLab 
Workshop in June of 2006. From that point until 
March 2007, we averaged about one process for every 
two nodes. Toward the end of this period, we saw us- 
age increase to approximately one process per node. 


Ganglia has been very useful in Evergrow and 
EverLab, but was not particularly successful in the 
PlanetLab environment. In 2001, one of the Ganglia 
developers joined the PlanetLab project and began de- 
ploying Ganglia over PlanetLab. Over time, the Gan- 
glia installation was removed and forgotten. It seems 
that Ganglia did not provide enough benefit to the 
PlanetLab team to warrant its maintenance costs. 


We believe Ganglia was appropriate for EverLab 
but inappropriate for PlanetLab for of the following 
reasons: 

1. EverLab is naturally organized by clusters of 
nodes. The gmond daemons communicate over 
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the local network to their gmetad parents. In 
PlanetLab, there is no natural structure and 
hence each gmond must send remote messages 
to a centralized gmetad. This increases the rate 
of lost message and requires significant band- 
width at the central node. 

2. PlanetLab nodes have traditionally been heavily 
utilized. It is very rare to find a PlanetLab node 
that has no running processes. Ganglia shows 
that their nodes are busy. Ganglia provides 
dozens of detailed metrics, but does not differ- 
entiate between slices. From a high level per- 
spective, We use Ganglia only to show that a 
node is busy or down. The detailed metrics 
have not been useful in our environment. 

3. PlanetLab has more than 600 nodes. Ganglia 
does not scale particularly well over a few hun- 
dred nodes. For example, in Evergrow, the 
Ganglia web front end periodically queries the 
main gmetad daemon for the system state. The 
data is returned as a single large XML docu- 
ment at least 220K bytes in length. The equiva- 
lent file for PlanetLab would be more than 1 MB 
large. Sending this data over the wire every 60 
seconds is inefficient, particularly since most of 
the data has not changed. 


We are very pleased with Ganglia for our size in- 
stallation and would be likely to choose it again. For 
us, the benefits of a simple monitoring system out- 
weigh the network traffic overhead. Ganglia is a rough 
tool. The instantaneous data is frequently incomplete 
because of the way that the gmetad daemon collects 
and updates its internal data structures. Improvements 
in this area would make Ganglia more useful and reli- 
able. 


EverStats 


Ganglia provided our system with cluster and 
host level monitoring. We could identify node status 
and utilization. But our project administrators wanted 
more. They asked us about the users and their projects. 
Which projects were deployed on EverLab? Were the 
projects computationally focused or more experimen- 
tal? How many unique projects were actually using 
the system? To address this issue, we developed the 
EverStats usage monitoring system. 


Our first challenge was to collect data from the 
EverLab nodes. We knew that the CoMon [15] project 
had developed a tool on PlanetLab to monitor node us- 
age, so we borrowed their underlying monitoring tool 
called slicestat [14]. The slicestat daemon runs on 
each PlanetLab node and like the gmond daemon col- 
lects performance metrics. Slicestat has the added ben- 
efit that it understands the PlanetLab virtual server ar- 
chitecture and can report data according to each slices 
activity. 

The CoMon project polls each slicestat and pro- 
vides current CPU data along with 1 minute and 15 
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minute network statistics. In many ways CoMon over- 
laps with Ganglia as a monitoring tool. 


Our interest was not in the short term node and 
slice status, but in the historical usage patterns. We in- 
stalled slicestat on all EverLab nodes and then built a 
custom tool called EverStats to poll the deamons and 
store the results in a long term database. EverStats 
provides summary reports on usage for the past week, 
month and year and allows administrators to drill 
down from both a slice or node view. 


In order to minimize network traffic, EverStats 
polls the slicestat daemons once every five minutes. 
We keep the raw data for 24 hours and then summa- 
rize it as daily data in our database. Short term activi- 
ties such as running a program for 30 seconds are un- 
likely to be visible on Everstats. On the other hand, 
computation or experiments that run for a hour or 
more will certainly be reported. 


There are more than 65 unique projects regis- 
tered in the EverLab database. New users tend to cre- 
ate a test slice to familiarize themselves with the sys- 
tem. These users then move to a project slice that is 
shared by their research team. 


The EverLab slice groups report reports on the 
cumulative data for all slices in each defined group. 
The sample EverStats Slice Group Report shows a 
representative report from April 2007. The System 
group represents the basic EverLab services. These 
services tend to be active for short periods of time, but 
are visible when a node is idle. 


The Condor slice group represents the distributed 
Condor instantiations on each of our nodes. Condor is 
currently under very light load and hence the values 
represent a sort of steady state overhead similar to the 
System group. 

The rest of the groups represent research origi- 
nating at their respective universities. The “Others” 
group is a catch-all for research from one of our non- 
cluster partners. There are currently seven slices in the 
“Others” group. The university research groups have 
between two and five slices each. 

As can be seen from the report, most research 
projects do not use the full power of EverLab. As the 
number of nodes in an experiment increases, so too 
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does the complexity of deployment, debugging and 
monitoring. Projects tend to use the maximum number 
of nodes necessary to produce a reasonable academic 
paper. 

Of the six project groups, more slices are CPU 
bound with CPU loads running between 88.44% and 
261%. A factor over one indicates that there is more 
than one process running concurrently in these slices. 
Experienced High Performance Computing researchers 
attempt to allocate exactly one process to each proces- 
sor in order to decrease contention for CPU cycles 


The “Others” group provides a good example of 
slices involved in networking experiments. These 
slices are generating on average 45Kbps of incoming 
and outgoing traffic over the life of the experiments. 
As can be seen, the System and Condor groups are 
minimal users of both CPU and Networking. 


EverStats is a useful tool in its current instantia- 
tion. Potential extensions include the graphing of 
trends and improvements in the sampling technology. 
Graphs and Trend analysis would be helpful for pre- 
sentations and for tracking the natural growth and de- 
cline of project activities. 


The current sampling technology serially queries 
each of the nodes. In addition to being inefficient, this 
approach takes a significant fraction of the five minute 
query period. While acceptable for EverLab, the serial 
query mechanism takes more than thirty minutes on 
the full PlanetLab for each query cycle. 


Ideally, we would like to see EverStats integrated 
into the base CoDeen and CoMon projects for use by 
both Private and Public PlanetLabs. 


Education 


The Evergrow project is composed of researchers 
in Computer Science and Computational Physics. All of 
our researchers are computer literate and have some sci- 
entific programming ability. At the beginning of the Ev- 
ergrow project, we assumed that researchers would use 
any and all computational resources that we could pro- 
vide. In fact, we found that computational resources are 
currently widely available. Desktop workstations have 
enough processing power to handle many tasks previ- 
ously allocated to dedicated processors. One of our 










Avg. Outgoing Avg. Incoming 


Bandwidth (Kbps) 


44.70 46.69 





Table 1: Sample EverStats slice group report. 
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authors processes gigabytes files on his laptop. The per- 
formance is not great, but the benefits of taking your 
work with you outweigh the time to completion. 


Our partners provided a list of requirements 
when we started the Evergrow project. Some wanted 
High Performance Computation (HPC) services. Others 
wanted distributed network platforms for experimenta- 
tion. The resulting EverLab system provides both, but 
we found that our partners needed help getting up to 
speed. Our most effective tool has been hands-on work- 
shops. Our first workshop was held in June, 2006. 


PlanetLab (and EverLab) present the world as set 
of virtual servers running on remote hosts. At one 
time, using telnet, SSH and X-windows was the stan- 
dard method for interacting with remote hosts. Today, 
undergraduate and graduate students use the Microsoft 
Windows platform and Microsoft Remote Desktop 
connection. PlanetLab’s interfaces are much more ba- 
sic and are less familiar to many of our partners. 


During our workshop, we walked the participants 
through the EverLab process. To get started running 
your own code on EverLab (or PlanetLab), a re- 
searcher must: 

1. Have a registered site and Principal Investigator 
(PI). We created a pseudo-site called EverLab 
for all of our users. 

2. Request an account by filling out a web form. 

3. Wait for the account to be enabled by the PI or 

site administrator. 

. Create and upload an SSH key to the manage- 
ment web site. 

. Have the PI create a slice for your project. 

. Assign the users to the new slice. 

. Assign the slice to one or more nodes. 

. Wait until the slice propagates to the target 
nodes. 

9. Log into the slice using SSH. 


The total latency from start to finish is minimally 
about one and a half hours. For a user trying this re- 
motely, it can take between one to three days just to be 
able to log into the nodes. The major benefit from our 
workshop was to shorten this initial period and to get 
users working during workshops’ first day. The sec- 
ond day was spent learning about Ganglia, Condor and 
custom deployment scripts that other researchers have 
written for deploying experiments on PlanetLab/Ever- 
Lab. Summary of presentations and tutorials are avil- 
able on the web [2]. 


We have found that all of our active researchers at- 
tended our workshop. It may be that other researchers do 
not need our dedicated resources, or that the learning 
curve is too steep. We plan on continuing our education- 
al efforts and working with our researchers to identify 
the barriers to better system utilization. 


_ 


AOrAIAN 


Future Work 


We have identified a number of areas for future 
work on PlanetLab in general and in particular on 
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EverLab. Each of these areas are extensions of our ex- 
perience with the current EverLab system and its user 
community. 


Security 


Fedora Core 2 (FC2) was first released in May 
2004. Fedora Core 4 was the more recent release as of 
September 2005, when we started working on Planet- 
Lab. As of the summer of 2006, EverLab is still run- 
ning FC2 on its nodes and FC4 on its central manage- 
ment node (PLC). 


Fedora Core 2 officially reached its end-of-life in 
June 2007. Fedora Core 4 reached its end-of-life in 
January 2007. Fedora Core 5 was retired in July of 
2007. The implication is that bug releases and security 
patches for these systems are no longer available from 
the Fedora team for these systems. 


Our experience and that of the PlanetLab Con- 
sortium is that there have been almost no security is- 
sues related to FC2 or FC4. The latest PlanetLab V4.0 
release still supports only FC4. Common wisdom 
would suggest that we update to FC7 as soon as possi- 
ble. Our experience has been that our deployed ver- 
sion of the three year old FC2 has been stable and se- 
cure and that there is little urgency to upgrade. 


Usability 


Our user community differs from the standard 
PlanetLab community in their grasp of UNIX tools. 
The PlanetLab community includes many systems re- 
searchers who understand the Linux operating system 
and its user level tools in detail. Our community of 
physicists and computer science theoreticians do not 
have this level of systems knowledge. We would like 
to see future systems include a suite of basic, easy to 
use tools for accessing the nodes, deploying applica- 
tions, collecting logs and monitoring the experiments 
activity. In most cases, the effort to develop these tools 
is one of packaging and documentation. 


Improved Coordination 


With the release of PlanetLab V4, there has been 
significant improvement in the installation and up- 
grade processes for private PlanetLabs. The major re- 
maining issue is coordination on PlanetLab changes. 
We would like to see a PlanetLab Engineering Task 
Force (PETF) that would manage platform changes 
and coordinate platform security. 


We see PlanetLab as a moving target. There are 
many possible ways to extend and improve the system. 
The challenge is to choose the appropriate changes for 
the private PlanetLab community. Private PlanetLabs 
value stability and security over experimental features. 
The PETF would collect and document these changes. 
It would provide a repository for all blessed changes 
and versions of the system. 


As a production system, PlanetLab should have a 
security coordinator. The PETF would track published 
and zero-day attacks on PlanetLab or its constituent 


21st Large Installation System Administration Conference (LISA ’07) 211 


Everlab — A Production Platform for Research... 


components. It would provide timely notification and 
updates to administrators concerning these attacks and 
would coordinate efforts to detect and correct these 
problems as they occur. 


The PETF could organize workshops and confer- 
ences for Private PlanetLab administrators and users 
as a way to educate the community and to identify ar- 
eas for improvement and growth. 


Federation of PlanetLab’s 


One vision presented by the PlanetLab commu- 
nity is to integrate remote PlanetLabs in a federation. 
Users on one system would be able to utilize resources 
on federated systems while abiding by inter-system 
usage policies. This concept requires coordination of 
the detailed federation interfaces as well as the defini- 
tion of appropriate system level policies. 


The EverLab installation has many of the features 
required for a federated PlanetLab system. We operate 
in a production environment and our system is not over- 
subscribed. While our project would welcome federa- 
tion with other private PlanetLabs, our partners network 
administrators would be hard-pressed to open the sys- 
tem to non-partner sites without additional controls to 
protect their networks from abuse. 


Conclusion 


Everlab serves as a model for future research ef- 
forts. It bridges the gap between Grid based HPC instal- 
lations and free-for-all experimentation systems. Ever- 
Lab makes efficient use of administrative resources and 
provides reporting services to monitor system usage, reli- 
ability and responsiveness. EverLab provides a service 
for setting policy on resources so that all participants 
have access to the shared resource. With the exception of 
pure HPC projects, We believe that future efforts should 
include an EverLab style system for system manage- 
ment, resource allocation and monitoring. 
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ABSTRACT 


We report on and discuss our experiences with teaching Network and System Administration 
at the level of Masters at Oslo University College and the University of Amsterdam. At our re- 
spective institutions we have independently arrived at very similar models for teaching a tradition- 
ally vocational subject within an academic Computer Science framework by incorporating a strong 


practical component. 


Introduction 


For several years now academically inclined sys- 
tem administrators have struggled to identify the role 
and place of System Administration within the fields 
of Computer Science and Engineering. This effort has 
often brought controversy, with strong and diverging 
opinions dominating over any consensus. This is not 
uncommon for a novel field of study. Whereas many 
university subjects evolve from academic and voca- 
tional traditions that have existed for generations, the 
establishment of a curriculum in System Administra- 
tion (a hybrid subject somewhere between computing, 
science and engineering) cannot make the progress in- 
dustry and society needs by waiting for the results of a 
protracted evolutionary process. The need for system 
administrators is here now. 


In our degree programmes we have chosen to 
avoid the controversies and formulate courses based 
on our own experiences as system administrators and 
academics. We believe that professional educators 
have a better chance of settling these controversies 
than those whose passions rage, as we shall explain 
below. By anchoring the subject in established aca- 
demic traditions, but maintaining the hands-on aspect 
of the subject, university milieux have been more re- 
ceptive to the idea of system administration as an aca- 
demic discipline, even when they have not always un- 
derstood the initial vision. What is interesting is that, 
in spite of the lack of standardization in thinking, and 
in spite of superficial practical differences imposed by 
our local environments, the courses we have devel- 
oped in Oslo and Amsterdam overlap strongly in both 
flavour and substance, so perhaps consensus is not so 
far from reality after all. 


Our aim in this paper is to report on our efforts in 
this area. We do not present our courses as perfectly 
formed, finished products to be admired by all (no 
university course ever turns out the way its designers 
would like, due to numerous constraints and obsta- 
cles), rather we comment on the philosophies and im- 
plementations that have led us to make courses at our 


Universities. In each case, although we each have 
some relevant introductory Bachelor level courses, we 
have found that Masters level study programmes are 
most suitable for implementing system administration 
studies, since students benefit from Bachelor level 
skills in more standard computer science as well as 
from a certain breadth of background. We shall not at- 
tempt to view system administration as a profession in 
the sense of apparently similar organized job descrip- 
tions (like the medical profession), as this topic is 
charged with issues that go far beyond education. 
Rather we focus on our experiences as educators and 
point out how we have approached a problem that 
some have claimed was impossible: to turn system ad- 
ministration into a discipline. 


We begin by discussing our approaches to edu- 
cating system administrators, then we consider norms 
and standards in related fields. We summarize the con- 
tent that we consider to be essential and report on our 
experiences with this. Finally we draw some initial 
conclusions about our successes and failures. We hope 
that the strong similarities we have arrived at in our 
own neutral attempts to formulate system administra- 
tion academically will help others to see their own 
subject or profession through more impartial eyes, and 
perhaps even help to advance the state of understand- 
ing of the field. 


Relationship of System Administration to Comput- 
er Science 


Let us begin by asking how system administra- 
tion relates to other fields of research that are com- 
monly associated with it. Computer Science derives 
traditionally from two camps: mathematical logic and 
electrical engineering. Both of these have long aca- 
demic traditions that have influenced the way comput- 
ing has been researched, framed and is taught in Uni- 
versities. In addition it is known that physics graduates 
are well represented in some areas of Computer Sci- 
ence. Not everyone in working with computers has 
learned the subject in a university. Computing as a 
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phenomenon is young and many self-taught hobbyists 
have entered the workplace on the strength of their 
own private initiative. Such people ought not feel 
threatened or belittled by academic initiatives. 


System administration has not normally been a 
subject in its own right, though many short courses 
have been offered by universities around the world. 
However it has not been completely absent from cur- 
ricula. Some work that has gone into studying “com- 
puter management” from universities and graduates 
has been in the area of telecommunications, thanks to 
often generous sponsorship of powerful telecom orga- 
nizations. Much of this work has been closed-source 
however and centered around organizations like the 
Telemanagement Forum (TMF) and the Internet Engi- 
neering Taskforce (IETF). It goes back to the 1970s 
under the title of ‘Network Management” and, in 
some respects, many of the issues facing system ad- 
ministration have been discussed and “solved” within 
that limited context, e.g., see [1]. Thus, system admin- 
istration lags behind its related “big brother”” Network 
Management. Many electrical engineers who have 
found themselves in the clutches of information tech- 
nology revolution have entered this field through 
telecommunications. It is taught in various courses es- 
pecially in Europe (where most of the research in this 
field is carried out) and it is represented by major con- 
ferences like NOMS [2] and IM [3] and the IARIA [4] 
conferences, organized by major telecommunications 
and routing companies like Cisco and Motorola. Since 
about 2001 we have made a concerted effort to cross- 
pollinate these disparate communities. 


A second group that has involved itself in man- 
agement concerns is software engineers. Distributed 
software systems and middleware are often used to 
“manage” software layers, instrumenting software 
with inter-communication capabilities that need to be 
managed just like more complete operational environ- 
ments. Although the list of challenges is somewhat 
smaller in software engineering, and this gives software 
engineers an oversimplified impression of the chal- 
lenges of system administration, the overlap makes a 
connection with software engineering. It also contributes 
to the widespread belief amongst computer scientists 
that system administration can be solved through soft- 
ware engineering alone, though this seems to be chang- 
ing as computer systems become more ubiquitous (see 
conferences like DSOM [5] for these communities). 


How Should System Administration Be Taught? 


The need for formal education is rarely disputed 
but the form is often controversial, especially amongst 
those who learned some of the necessary skills flying 
by the seat of their pants. It is common for self-taught 
practitioners to reduce subjects to a list of a few skills 
that are needed, then “the rest is experience.” It has 
even been suggested that it is “‘too early” to formalize 
anything about the field, since it is not yet well enough 
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understood. However, the job of a skilled educator is 
to pragmatically distill such experience into a litera- 
ture of material that can be taught, and we believe that 
it is never too soon to do this. Sufficient understanding 
is a journey rather than a finished product. 


There is no single approach to learning that can 
solve all of a society’s needs. The need for learning 
does not disappear once we are finished with school 
and so there is a need to combine further education 
with work in some appropriate way. As we shall dis- 
cuss below, society is becoming less tolerant of paying 
for schooling, and is increasingly pressuring both en- 
rolled students and would-be students to work along- 
side their further education. 


Education is typically covered by a three-pronged 
strategy which involves: 
© Self Learning. 
e Training. 
© College/university studies. 






Education 


LN 


Self-Learning 


Figure 1: The three aspects of learning. 


Although most education in system administra- 
tion is in the first two categories, it was our intention 
to formalize the subject into an academic discipline in 
our master degree programmes. This has been a con- 
siderable challenge and has taken the better part of ten 
years of preparation in both our groups. 


Our colleges have contributed several important 
text books in this field, one at Bachelor/Master level 
[6] and one at Masters/Ph.D. level [7]. A book for 
bachelor level studies also covers the Norwegian mar- 
ket [8]. 

Self-Learning 


Self-learning is learning without the guidance of 
a tutor or course coordinator. There is no assessment 
and no interactive diagnosis of a student’s progress. 
Self-learning is clearly a necessary part of any learn- 
ing scheme learning without some effort by oneself 
would require technology yet to be invented (some 
would say that teaching is, in fact only management, 
that all we do is promote an environment which accel- 
erates the student’s own learning process by providing 
for, inspiring them and guiding them). Self-learning 
however implies that a student does not have access to 
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an organized programme of study and is therefore 
lacking in the potential benefits of others’ experience. 


Self-learning is usually the only option available 
to the would-be student at the outset of a new field of 
study. In engineering disciplines particularly it re- 
mains an important aspect of a learning strategy, in the 
guise of “trial and error” practice. Getting one’s 
hands dirty in the field is one of the most important 
confidence building exercises, and it is a fast track to 
connecting experience to meaning. Trial and error is 
an efficient approach to knowledge acquisition be- 
cause one can easily see the patterns of success and 
failure at first hand. Both authors have had their share 
of self-learning experience. 


Self-learning is not the same as experimentation 
or lab work. Many technicians working in laboratories 
are simply carrying out practices procedures that are 
often later replaced by automated machines. There 
does not need to be understanding to complete a task. 
Self-learning implies a development of understanding 
through one’s own effort. The effort might simply in- 
volve reading, but in most cases it requires a person to 
engage their motor functions and do something physi- 
cal (whether the doing is simply writing, solving math 
problems, or plugging cables into boxes). 


We might define a subset of self-learning to be 
self-training, which we understand as the study of 
recipe-solutions to problems from standard materials. 
This is a common form of training for certification pro- 
grammes. Self-learning can therefore span both educa- 
tion and training, in the following sense: significant 
reading can lead to a broad perspective on problems if 
the reading is broad enough and the self-taught student 
has the opportunity to put the reading into context. 
Training 

Training is a form of targeted knowledge presen- 
tation often aimed at teaching skills or giving summa- 
ry overviews. It can never be a substitute for a com- 
plete education; it is mainly useful for filling in gaps 
in a basic knowledge base, such as learning a particu- 
lar procedure or using a new tool. Training is not a re- 
alistic strategy for teaching complete and ‘rounded’ 
professionals because there is inevitably a need for 
long-term integration of knowledge into one’s existing 
cultural base. 

For people in full time employment, training is 
often the only available alternative to interact with a 
teacher. Short courses given at conferences or by other 
providers offer an access route to organized classes 
when students have time limitations, such as during 
full-time employment. Training courses are usually 
based on a set of slides and are presented at a low lev- 
el for a wide audience, since they are financed by their 
commercial success. This leads to a side-effect of 
training which is an unintentional ‘“dumbing-down”’ 
of the material in order to reach the largest paying de- 
nominator. 
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Some companies offer training programmes to 
graduates because they see gaps in their knowledge in 
specific skill areas. A graduate already has a basis of 
deep knowledge into which a training course can be 
integrated. A notable training programme that exceeds 
most is the Cisco Academy/university training pro- 
gramme. 


The traditional form of training is that of half- 
day to one-day classes at conferences such as LISA, 
USENIX, NOMS [2], IM [3], etc. These are slide 
shows with commentary. Training tends to be supplied 
as a list of recipes and do’s and don’ts. There is little 
time to develop any significant understanding. If an 
idea does not resonate with the trainee more or less 
immediately, he or she has little recourse other than 
recommended texts to fill in the gaps. 


Training can be a “leg up” to help a motivated 
employee to self-learn for instance. This motivational 
aspect of teaching should not be underestimated. 
Training at conferences forms part of a (hopefully) 
positive experience that can have a powerful motiva- 
tional effect. It is also sometimes accompanied by cer- 
tification to capture part of the benefits of verification. 
Certification involves some kind of test, normally a 
multiple choice pychometric evaluation. Verification 
and testing can be employed in principle to ensure that 
trainees reach a standard interpretation of the subject 
of training, rather than their own (perhaps distorted) 
version. 


College Studies 


Accredited college or university education is the 
oldest kind of higher education, the most extensive 
and the only kind of learning that addresses all of the 
strategies for learning at the same time. A planned ed- 
ucational programme in a college or university has the 
potential to incorporate elements of both of the fore- 
going approaches within an extended curriculum. The 
ability to test students repeatedly serves also the pur- 
pose of measuring their abilities and motivating them 
to improve their performance. 


A college education can build up concepts and 
principles over an extended period of time, and place 
them in a larger context. This is an important cultural 
experience. Humans learn essentially by story-telling, 
in which they must form their own version of a story, 
and place themselves within it, before they are willing 
to believe in it fully. An extended programme can ex- 
plore with students the reasons behind what they are 
learning and build their own personal motivation. In a 
college programme, students have time for trial and 
error and they have the opportunity to put knowledge 
into a context. 


College education is more interactive than either 
training or self-learning. There is a greater opportunity 
for feedback, building confidence and quickly correct- 
ing misconceptions. Moreover, the experience has an 
epidemic effect — one conversation with a student can 
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improve the teacher’s understanding and spread to an- 
other students, and so on in a viral way. 


Another phenomenon introduced by college edu- 
cation is the concept of “group self-learning,” where 
students study in the alternating role of teacher and 
trainee. This so-called “‘power learning” is a useful 
extension to the materials and wisdom brought in by 
the official teacher. Both our institutions have prac- 
ticed this with success. 


Pre-requisites and the Culture of Learning 


So what ought students know before and after their 
education? As university lecturers, we affirm that the is- 
sue of what students ought to know in advance of a 
course programme is highly politically charged. Educa- 
tion is a cultural phenomenon and priorities are con- 
stantly changing (not always for the better). Colleges 
and universities inherit students from other schools and 
workplaces, and cannot guarantee that applicants will 
reach the minimum bar expected for starting. 


One difference between our study programmes 
lies in the attention to pre-requisites. At Amsterdam, far 
more work is put into selecting strong applicants and 
deterring weaker candidates than at Oslo, because the 
curriculum is taught in half the time and under greater 
pressure. An additional year for personal growth allows 
the course at Oslo to pull through students who might 
not make it in the pressure-cooker model used at Ams- 
terdam. 

Both institutions require basic programming, know- 
ledge of operating systems principles and some Bachelor 
level mathematics. No institution can ever teach every- 
thing someone might need to know about a subject and 
thus colleges and universities do not try to do so. They 
do not usually give “training” in specific skills except as 
an example to a more general discussion. Rather, the ap- 
proach is to charge students with the more fundamental 
skills needed to learn for themselves, along with a critical 
eye so as to not take everything at face value. Education- 
al institutions have developed strategies for teaching 
these general skills over centuries. These methods have 
become cultural norms, and there are many reasons why 
we should respect them as norms, even if they do not 
train students to specifically use tool X or carry out pro- 
cedure Y. 


Teaching educational norms is important especial- 
ly because it results in graduates who can communicate 
in a cross-disciplinary field and society at large. Exam- 
ples of teaching norms include basic science and lan- 
guage skills (reading, comprehension, reporting, cre- 
ative writing, summary etc.). Mathematics and physics 
are taught, for example, not usually to allow people to 
calculate planetary orbits, but because the skills one 
must surmount to complete these topics have general 
validity. Moreover, since most other people have been 
through the same experience, it becomes shared knowl- 
edge and this allows any student to communicate with 


Burgess & Koymans 


any other person who has learned the same set of stan- 
dard concepts. (“Remember how we used to calculate 
the efficiency in physics? We could do the same thing 
here...””) This would not be possible if the training were 
too specific and too directed. The ability to write clearly 
and to reason about problems is something else stu- 
dents of mathematics and physics learn. Writing and 
language skills have many of the same features of 
mathematics: grammatical structure, attention to de- 
tail, the need to interpret the meaning from a potential- 
ly ambiguous signal, and so on. 


This cultural aspect of education is under pres- 
sure today from impatient and under-educated spokes- 
persons in our societies who would dumb it down. The 
urgent rush to specialize and “train” people for ‘‘the 
jobs industry needs today” eventually becomes harm- 
ful to the culture of learning in society. If we bypass 
well-known metaphors in favour of new short-cuts, 
sometimes using computer software to remove the 
need even for a basic skill to be learned, then we im- 
poverish the experience. We need these cultural norms 
in education, like math and science, both because they 
allow us to speak a lingua franca of reason that cross- 
es discipline boundaries, and also because they allow 
us to re-use the experiences and hard-won understand- 
ings of other fields. 


We believe it is important to understand the limi- 
tations of basic training: a trainee can repeatedly com- 
plete a task without any understanding whatsoever. It 
is only when the train runs off its tracks somewhere 
that one discovers that the uneducated student has no 
idea where he or she is in the landscape of knowledge. 


And this is exactly where the need for academic 
level education comes in. Not only to lay out the 
tracks and keep the train on those tracks (which can be 
trained more or less in the above mentioned education 
shortcut), but also in the case of derailment or when 
entering unknown territory. 


Modeling 


The philosophy of science gives us one of the 
most essential tools in the problem solver’s toolkit: the 
idea of modeling. As the philosopher of science David 
Hume maintained, there are two kinds of knowledge 
that should be clearly distinguished. 

¢ Theoretical knowledge that can be determined 
precisely and exactly (proven perhaps), but 
whose relationship to the real world is uncer- 
tain. 

e Empirical knowledge that is certainly about the 
real world, but whose measurement and inter- 
pretation are not exactly determined. 


In both cases our understanding of the world is 
imperfect. Our basic understanding of phenomena in- 
volves building approximate mental models, so a 
deeper understanding of this process can only help 
students towards a deeper understanding of the phe- 
nomena also. 
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A model is composed of suitably idealized ap- 
proximations that attempt to manage this uncertainty 
and allow one to: 

¢ Make predictions. 

© Calculate answers. 

¢ Classify and hence understand phenomena. 
These are important parts of rational thought. 


However, we are also faced with problems in this 
process. The usefulness of standardizing knowledge, 
so as to make it more authoritative, comes at the cost 
of whitewashing over these uncertainties. Science it- 
self has to some extent become standardized and even 
commercialized in curricula today, so that it is often 
taught misleadingly. Students come to attribute sci- 
ence an almost mystical reverence as they like many 
in society harbour the erroneous belief that science 
teaches “truth” rather than approximate models of 
practical value. 


There is a lesson here: shrink-wrapped packag- 
ing of education which violates the questioning and 
critical spirit of scientific discovery can breed igno- 
rance as quickly as it can teach skills. These are im- 
portant skills in system administration. It is vital that 
students be shown how to ask fundamental questions 
and be able to question assumed truths. Nevertheless 
there is a role for standardization in the language used 
to describe a subject. A common ontology is as impor- 
tant as shared cultural values in enabling students to 
communicate without talking at cross-purposes. 


Standard Curricula for Computer Science 


The Ironman Curriculum effort started by the 
Association of Computing Machinery is an effort to 
standardize the framework of topics in a consistent 
taxonomy. This has been a long term effort. The com- 
plete Ironman report includes a number of documents, 
e.g., [9]. These are far too numerous to discuss here. 
They can all be found at the ACM website.! System 
administration was initially absent from the Ironman 
documents, which have been growing since the late 
1990s and were finalized only in 2005. Several terms 
have now entered into the curriculum. 


It is interesting to see how terms are integrated 
into a traditional computer science framework. While 
some system administration activists would apparently 
prefer to define system administration as an entirely 
separate enterprise, here we see that topics have been 
slotted into the existing taxonomy of categories rather 
than defining it as a tumour to be tacked on to the 
edges of computer science. This partly reflects the ap- 
proach to the subject taken in Oslo, where some com- 
promise on words and terms has been made to inte- 
grate the knowledge into the larger picture of program- 
ming, Unified Modeling Language design, database 
searches, service orientation etc. However, there are 
topics in system administration that do not traditionally 
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appear anywhere else in computer science: fault man- 
agement, reliability, policy etc. These are also existing 
disciplines that overlap system administration with oth- 
er fields, possibly in different university faculties. 


On closer examination the course material in our 
Masters programmes fits (surprisingly) comfortably 
into the Ironman standard curriculum as long as one 
reads it with appropriate glasses. This means that even 
colleges without an explicit degree in system admini- 
stration could put together a helpful syllabus that 
would be relevant to the field, with the help of some 
massaging of language and examples. 


From Principles to Content 


We turn now to the content of our degree cour- 
ses. What are the key areas that constitute an educa- 
tion in system administration? Previously the System 
Administrator’s Guild SAGE, with the special assis- 
tance of Rob Kolstad, has proposed to build a taxono- 
my for system administration. Other possibilities for 
mapping knowledge have been proposed recently with 
the growth of interest in semantic webs: the concept of 
ontology is presently quite popular [10]. An ontology 
goes beyond this with an extended list of terms, usual- 
ly belonging to a single and uniform cultural body. So 
far however, this has not formed a useful basis for an 
educational map where common concepts bind topics 
together rationally. One reason for this could be that 
taxonomies are inherently subjective and such subjec- 
tive impressions and focal points change very quickly 
in the technological disciplines and create more con- 
troversy than consistency. 


At Oslo university College, our approach has 
been to search for a stable core that can be used to 
teach understandable models and principles, and then 
to pepper this core with contemporary examples and 
practice, because understanding is based on models. 
The course structure was built from the Bachelor level 
up by identifying common principles from a mass of 
empirical writing and practice [6]. However, in spite 
of having courses at Bachelor and Masters level, there 
are topics that we cannot cover sufficiently for every- 
one’s taste. For example, there is probably sufficient 
material to give an entire course on “storage” (quite 
desirable in present times), but this would only mean 
less time for something more fundamental and the 
time might simply be wasted when nest year the tech- 
nologies were different. 


We find that, in spite of good intentions, we have 
a particular difficulty giving students a realistic insight 
into practice. Some basic skills about practice can be 
learned through laboratory work and research, other 
skills must be learned implicitly by reading and writ- 
ing. We are forced to make a judgment about the best 
use of a student’s time for the long run. 


An obvious example where an ontology might 


help to rationalize our approach is in finding a common 
set of concepts to be used by UNIX administrators and 
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Windows administrators. Similarly, the chasm of termi- 
nology between network management and system ad- 
ministration is so vast that most telecom network man- 
agement people do not even realise that system admini- 
stration is a task. 


It would be an easy option to give courses in net- 
work management, if the idea was simply to fill up a 
curriculum to attract students. Europe and Asia have 
dominated the research and commercial activity in 
Network Management for many years, while the Unit- 
ed States has been the greater champion of system ad- 
ministration. Network management has been dominat- 
ed by the input of software developers using data 
models to ‘“‘manage”’ (usually only to record in a data- 
base) information about network devices. The subject 
is highly protocol oriented and leans towards layered 
models of centralized management. System admini- 
stration has been more about UNIX and its highly at- 
tractive open environment — giving great freedom to 
individualists, but consequently lacking focus. An on- 
tological study trying to place knowledge into one cul- 
tural framework in such a way that it can be mapped 
into another, even inexactly, could help to bring these 
two fields together more quickly. 


In putting together courses at our two universi- 
ties, we have purposely not been led down the paths of 
least resistance. Rather we have tried to supplement 
the somewhat bureaucratic views of network manage- 
ment with a more engineering viewpoint. We do not 
believe that “management” should be equated with 
monitoring of devices, nor with change management 
models or database modeling. Instead we have looked 
for a constructive approach to systems: how to build 
and maintain them, with a critical eye. 


In our chosen course profiles we have effectively 
proposed our own poor-man’s ontologies of most 
meaningful areas, as many words are inconsistently 
used, but they are coloured by our own particular 
tastes and specialties. We shall present some details of 
these below. As readers will see, the basic ideas cho- 
sen by both institutions have emerged along quite sim- 
ilar lines. 


Two Year Oslo Programme 


In our initial plans for an international two year 
Masters at Oslo, we expected to be able to require a 
number of courses in basic system administration re- 
lated skills, possibly by different names, to the level of 
our own course System Administration #1. We also 
hoped also to request some basic computer security. 
However this wish list soon proved impossible to 
achieve. With the exception of our own students, there 
were practically no other student applicants with this 
kind of background. In security especially, we could 
see that most colleges offering courses in “security” 
were really teaching applied encryption, not the kind 
of rounded security management that we expected. 
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Consequently we were forced to lower our expecta- 
tions to admit students with any kind of Bachelor in 
Computer Science, possessing basic university level 
maths (discrete math, calculus and matrices) and de- 
velop a common framework. We altered our priorities 
to teach students their missing skills during the master 
programme. Several approaches have since been tried, 
including intensive catch-up work in the lab; however, 
we have ended up by offering the missing courses in 
their entirety as optional modules, since one cannot di- 
gest the concepts in a few lab exercises. 


The need for a programme that straddles the dual 
requirements of an academic degree and a strong prac- 
tical component is undisputed. The scientific tradition 
in network and system administration is rather weak, 
and it has been one of the goals of the Oslo research 
group to strengthen this. As one of the few institutions 
carrying out scientific research in this field, we are in 
a unique position to be able to feed the results of cur- 
rent research into teaching. This happens continuously 
as new developments and technologies emerge. 


One possible approach to curriculum develop- 
ment is to take a mercenary approach and give people 
what they want. When asked what skills graduates 
should have, employers are quite unclear however. 
Some employers would like graduates to be ready- 
trained to begin work installing a Storage Area Net- 
work. Others want graduates who can “think for 
themselves” and see the “big picture.” It was left to 
the college to decide the curriculum. 


We recognized that we cannot teach our gradu- 
ates all of the skills required to be a successful system 
administrator. What we hope to achieve is for them to 
think clearly, constructively and critically about prob- 
lems, to learn for themselves and be independent 
thinkers. We have consciously avoided naming courses 
with specific technologies (e.g., LDAP or APACHE) 
or skills, except perhaps in the case of the Supercom- 
puting course, which is run by a third party. 


Skill-training is probably the weakest part of the 
courses in Oslo. Students are expected to gain practi- 
cal skills as a by-product of their work in the laborato- 
ry. Some but not all students achieve this. The result is 
that good students learn both practical and analytical 
skills, while other students tend to excel in only one or 
the other. On the other hand, we receive clear feed- 
back from students that the subjects that “change their 
lives” the most are the more scientifically oriented 
courses, such as the basic laboratory training and Ana- 
lytical Methods courses, as many of these basic scien- 
tific ideas have never been explained to them before as 
something they could use. 


Course degree programmes are naturally hostage 
to the general trends in world education. These vary 
from country to country and we see all variations in 
our international programme. We must cater to quite 
different attitudes to learning and skill sets. 
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Goals 


A brief summary of course goals at Oslo follows: 

. Graduates should have an insight into the most 
important technological developments and sci- 
entific results in the field. 

2. Graduates should have the ability to apply their 
knowledge and insight in this field. 

3. Graduates should be able to think in the abstract 
about systems, using the idea of models, general- 
izations and approximations. They should be fa- 
miliar with traditional terms, philosophies and 
concepts of modern scientific thinking. 

4. Graduates should know how to search the exist- 
ing literature of the subject. 

5. Graduates should have the capacity to commu- 
nicate clearly both orally and in writing. They 
should be skilled in giving clear and compre- 
hensible presentations and be capable of ex- 
plaining their ideas at the appropriate technical 
level. 

6. Graduates should be aware of societal and ethi- 
cal aspects of the use of computer technology 
and its management. They should be able to 
make ethical judgments and argue for these. 
They should recognize the difference between 
an opinion argument and a rational scientific 
judgment. 

These goals can be achieved in a number of ways. We 
divide the qualities amongst broader subject areas. 


Research Based Teaching 


Our focus on a strong sense of scientific values 
has been driven by the desire for our research into 
computer systems to be of the highest scientific stan- 
dard (a standard in which Oslo university College 
leads). Our participation in a European Network of 
Excellence (EMANICS) also feeds directly into our 
programme and has motivated several modernisations 
of terminology and minor changes of focus in the top- 
ic we cover. The fact that we focus on general princi- 
ples, illuminated by examples, allows us to rapidly in- 
corporate changes in current technologies and ways of 
thinking without completely changing the curriculum. 


Our course in High Volume Services, on the oth- 
er hand, was an area in which we responded directly to 
a need from industry to system administrators with 
competency in large scale data center deployment. 
Such a course did not exist anywhere in the world to 
our knowledge and thus we instigated a number of re- 
search projects which formed the basis of this course 
[11, 12, 13]. See Appendix Course Descriptions from 
Oslo for courses descriptions. 


Text Books 


The lack of textbooks was initially a problem, 
but three books form the core of the principles of the 
course. The first book newcomers need is Mark 
Burgess’ Principles of Network and System Admini- 
stration [6]. Without this book, students who have 
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never worked as system administrators have no idea 
what the subject is about. Later, the Practice of System 
and Network Administration by Limoncelli and Hogan 
[14] provides excellent advice about experiential and 
management matters, but remains difficult for students to 
understand without experience of working as a system 
administrator. It is recommended as supporting literature 
as it is eminently readable. Finally, we use Analytical 
Network and System Administration along with exten- 
sive exercises as the basis for teaching scientific method. 
This book is too difficult in its presentation however and 
only excerpts are used in practice. In addition to these 
core books, we provide students with a library of many 
technical books on special topics. 


Finally, to bring system administration up to the 
state of the art, we have collaborated in the publication 
of a new collection of essays by experts in the field of 
system administration called the Handbook of Net- 
work and System Administration [10]. 


Examination Forms 


We require students to read, write, speak and “do” 
well during their sojourn at the College. The ability to 
both think and communicate those thoughts is central to 
our ethos. We test students on their skills in 

e Impartial reporting of procedure and results 

(scientific method). 

¢ Opinion or standpoint formulation (decision- 
making). 

Our preferred examination form is the oral exam- 
ination combined with course work. This is a very 
cost-effective approach to gauging students’ under- 
standing, as long as student numbers are not too great. 
By teaching at Masters level, we can make this eco- 
nomically viable to keep a smaller number of students, 
supported by research activity. 


Even students who are initially weak in English 
language end up being able to make reasonable pre- 
sentations and have no serious problems in mastering 
this form of examination, thanks to a consistent em- 
phasis on communication skills throughout the course. 


Relationships Between Courses 


The course programme was designed with partic- 
ular care (see Figure 1) to teach concepts, theory and 
combine this practical experience. Since some con- 
cepts require skills that computer scientists are partic- 
ularly poor in (e.g., calculus, statistics and empiricism) 
it is especially important to introduce these concepts 
repeatedly over time. The programme is composed of 
four semesters which have principal goals as follows: 

1. Provide introductory or fundamental concepts, 
knowledge and skills. 

2. To make students independent in their learning, 
experience self-learning by trial and error etc. 

3. To teach students how to view the world analyti- 
cally using models and experimentation fol- 
lowed by interpretation. Students are encouraged 
to develop the capacity for original thought. 
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4. All of the above are implemented in an individ- 
ual project. 


Although Masters degree projects are increasingly 
being allowed as group efforts in colleges and universi- 
ties, we strongly encourage students to undertake indi- 
vidual projects. In certain cases projects have been relat- 
ed, but then each student writes an independent thesis. 


The figure shows how courses follow on from one 
another and build knowledge and experience that is 
reused in later courses. The ‘most important’ courses are 
those which collate, integrate or serve knowledge to and 
from the largest number of other courses. These include 
the laboratory course and the analytical methods course. 
Naturally the final thesis itself essentially builds on all 
of the foregoing courses. However, this depends on the 
exact nature of the project chosen by the student. 


In addition to actual subject matter, there are sev- 
eral cultural delineations of the same material by dif- 
ferent industrial and academic subgroups: 

e Network Operations community perspective. 

e The telecom industry perspective. 

e The UNIX community perspective. 

© The Microsoft Windows or Apple MacOS per- 
spective. 


Although we try to cover all of these viewpoints, 
with emphasis depending on context, we are notably 
stronger in some areas than others. In particular our 
key strengths at OUC are in UNIX. We have found 
that it is useful to maintain this emphasis in our cour- 
ses since it makes our curriculum unique, and fills a 
niche in education that is missing from curricula of 
other institutions, where Microsoft systems are often 
covered exclusively. 


One Year Amsterdam Programme 


The one-year master’s degree programme in Sys- 
tem and Network Engineering(SNE) is a successful 
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new study programme provided by the Universiteit 
van Amsterdam (UvA). It is offered in collaboration 
with the Hogeschool van Amsterdam (HvA). In 2003, 
an enthusiastic team of experienced system and net- 
work administrators with academic backgrounds initi- 
ated the process of designing and implementing this 
degree programme. 


Almost immediately, it became apparent that the 
programme served a previously unmet societal func- 
tion: the provision of academically trained system and 
network administrators. In the past years, it has also 
become obvious that students who have completed the 
programme successfully have found positions in a 
wide variety of areas. Nearly without exception, the 
employers concerned have commented on the valuable 
knowledge and skills that these students have brought 
to their work. 


This success is partially the result of the pro- 
gramme’s design, organisation and philosophy, with spe- 
cial considerations related to the short one-year track. 

The following are among the valuable features of 
the degree programme: 

e A strong connection to practice with an ap- 
proach that is simultaneously theoretical, prod- 
uct-independent and supplier-independent. 

e A solid, clearly outlined and well-coordinated 
programme. 

e Strong social cohesion resulting from the large 
amount of time that students spend together in 
the SNE Lab. 

e Strong motivation and an excellent attitude to- 
ward study on the part of the students, partially 
due to a strict admission procedure. 

e Emphasis on concepts (knowledge/insight), with 
less emphasis on operational procedures. 

e A unique design for the final phase of the pro- 
gramme, which is divided into two short (one- 
month) research projects. 
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Figure 2: Dotted courses are fundamental sources of information and skills. Heavy lines show the most ‘central’ 
courses, i.e., those with the greatest connectivity or impact on collating knowledge. The four levels correspond 


approximately to the four semesters. 
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¢ Considerable emphasis on the preparation of re- 
ports and presentations. 

e Emphasis on current relevant themes, including 
open technology and security. 

¢ Strong involvement on the part of (core) lectur- 
ers. 

© Continuous evaluation and adjustment. 

e Strong connection to research, as conducted by 
the SNE group of the UvA. 


At the outset of the programme the name System 
and Network Administration was chosen. Experiences 
with all of the classes that have thus far completed the 
programme and with the employers who have hired 
the graduates have shown that, in practice, the concept 
of system administration is not usually associated with 
academic training. Therefore the name of the pro- 
gramme (but not its contents) has been changed to 
System and Network Engineering. Engineering is a 
better pars pro toto, because the programme places 
considerable emphasis on the architecture of systems 
and networks, including the engineering aspects, be- 
sides administration and management. 


For more information about the structure and or- 
ganisation of the Amsterdam SNE master we refer to 
the self-assessment made for the accreditation com- 
mittee visiting in March 2007 [15]. 


Goals 


The Amsterdam education has the same general 
goals as the Oslo master as detailed in Goals. Some 
more specific qualifications can be specified as fol- 
lows: 

1. Graduates should be skilled in exploring (search- 
ing, reading and evaluating) the many forms of 
documentation and literature concerning system 
and network engineering, with regard to both 
content and medium. They should be familiar 
with the ISOC, the W3C, IEEE and other inter- 
national bodies that develop standards and pub- 
lish in the area of computer systems and net- 
works. 

2. Graduates should be very familiar with the usu- 
al configurations and procedures for the normal 
and crisis administration of a variety of current 
systems and networks, middleware and applica- 
tions. They should therefore be quickly em- 
ployable in the usual multi-vendor systems and 
network contexts. 

3. Graduates should be very familiar with the se- 
curity functions of systems and networks, and 
they should be capable of contributing actively 
to the architecture and configuration of systems 
and networks that conform to current security 
standards. Graduates should also be able to de- 
termine whether systems or networks conform 
to particular security standards. 

4. Graduates should have the technical knowledge 
of communication protocols, network compo- 
nents and business systems that they will need to 
accurately justify choices and steps relating to 
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administration and security, including those re- 
garding configuration, procedures and security 
architecture. 

5. Graduates should have sufficient insight into 
the organisational contexts within which sys- 
tems and networks function to channel the 
needs of organisations and users, and to trans- 
late them into appropriate technical support. 

6. Graduates should have sufficient technical know- 
ledge and intellectual capacity to assume posi- 
tions of leadership in the field of system and net- 
work engineering within a few years. They 
should have the capacity to develop their own 
vision of the field of system and network engi- 
neering, thus contributing to evolution and in- 
novation in concrete system environments. 


Relationship Between Courses 


In an effort to realise a coherent programme, two 
main topics were formulated to serve as a unifying 
theme for the entire study programme. These topics 
relate to current and important themes within the pro- 
fession. 

® Open Technology, which involves open stan- 
dards (e.g., RFCs), open software (including 
open source) and open security (the antithesis 
of security through obscurity). 
¢ Security, both technical and non-technical. 
With respect to the educational design, the degree pro- 
gramme is based upon a set curriculum, which is the 
same for all students. The courses are offered accord- 
ing to a set schedule with many contact hours and an 
attendance requirement. The programme includes four 
technical-theoretical courses (CIA,2 SSN, INR and 
LIA), one non-technical theoretical course (ICP), one 
‘basics’ course (ESA), two practical courses (DIA, 
IDS) and two research projects (RP1, RP2). The inter- 
dependence of these courses is depicted in Figure 3. 





Figure 3: Interdependence of SNE courses. 


This approach, according to which students fol- 
low largely the same programme, strengthens the co- 
herence of the programme and the commitment of the 


2For an explanation of the course abbreviations, see appen- 
dix Course Descriptions from Amsterdam. 
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students to the study programme. It encourages stu- 
dents to attend and participate in the study compo- 
nents, and it has a positive effect on academic achieve- 
ment. 


The programme is continuously evaluated through 
discussions with students and through regular meetings 
by the lecturers, thereby improving the coherence of the 
various components. 


The curriculum involves a gradually increasing 
level of difficulty and during the course of the year 
students work more and more independently. This ef- 
fect can be clearly observed in the research projects, 
within which the quality of the work that is submitted 
at the end of the year is clearly superior to that of sim- 
ilar work that was submitted halfway through the pro- 
gramme. 


Programme Comparison 


It is instructive to compare the approaches in the 
two programmes carried out for the university Accred- 
itation committee for the master education SNE at the 
Universiteit van Amsterdam. We reproduce their table 
here, see Table 1, showing the correspondence. It is 
interesting to see that both courses emphasize the 
same basic ideas, even though the mode of implemen- 
tation is somewhat different. 


Course Implementations 
Oslo 


The implementation of any new programme 
presents a number of challenges both in terms of re- 
sources and imagination. We have now seen the 
progress of four groups of students at our college and 
are able to draw a number of experiences from this 
time. The opportunity to craft the entire curriculum 
without interference from national academic title regu- 
lation has also made the course curriculum highly co- 
hesive and integrated in comparison to the often poor- 
ly cohesive topics at Bachelor level. Students notice 
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this difference and have commented on it on several 
occasions in evaluation meetings. 


The strong international nature of our group, 
both in the staff and the composition of the students, 
makes an attractive milieu for our students. We have 
used teaching staff from the UK, Norway, Somalia, 
USA, and Switzerland including our fixed members. 
Our students come from all over the world, including 
Asia and the Far East, Africa, America, and Europe. 
Our English speaking environment works for the most 
part well, and the support from the admissions office 
has been excellent. Our active programme of Interna- 
tionalization has brought visitors from America, The 
Netherlands, and a number of countries participating in 
our European Union Network of Excellence EMAN- 
ICS. 


Our college is especially strong in the area of 
UNIX and GNU/Linux technology. This gives our stu- 
dents a clear advantage in service management as 
many server room installations are still based on 
UNIX technology. We have chosen to maintain this 
focus rather than dilute it with ‘more common’ Mi- 
crosoft technology because it gives the students an ad- 
vantage. They can learn the Microsoft technologies 
themselves based on their UNIX knowledge We aim 
rather to show how to integrate these different tech- 
nologies. 


A weakness of our course initially was that it did 
not cater well to students who lacked the assumed pre- 
requisite experience. Such ideal students were near 
impossible to recruit however, and so we have adapted 
the initial part of the course to arrange for essential 
background skills to be taught. The students we typi- 
cally attract: 

¢ Are looking for a way to work with computers 
that is not about programming. 

e Have rarely any background in UNIX. 

e Increasingly have not worked as system admin- 
istrators either, but are attracted by the term 
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“‘Network”’ as made popular by the spread of 
the Internet and broadband. 


One of the major strengths of our programme is a 
strong course in applied scientific modeling, not merely 
philosophy of science, but analysis, understanding and 
application of mathematical representations. All stu- 
dents are sceptical of the analytical, mathematical and 
scientific content we expect in the course, but many if 
not most students later claim that it ends up as being 
one of the most valuable courses to them, teaching 
them how to think about problems in a way that they 
had never learned before during their university careers. 


In a recent yearly meeting of the steering coun- 
cil, a student representative commented that he was 
learning more in our programme than he ever had be- 
fore. When asked why, he replied: “Because we have 
to.”” The students regard our courses as demanding, 
and they see this as a positive quality. 


A weakness we have identified in our course is 
the lack of realistic experience in the management of 
computer systems by the students. This is something 
that we have striven to simulate but without much suc- 
cess. Limitations in laboratory space and user activity 
have proven to be insurmountable problems for us. 
One way around this problem is to use internships 
where students spend their summer working in a local 
company. The practice of internships has not been 
widespread in Norway however, as the Norwegian 
summer holiday culture usually means that almost no 
one is left in companies in the summer months to pro- 
vide guidance for students, and hence none are taken 
on. However, most recently we have made some 
progress in this area too as Norwegian society adapts. 


Amsterdam 


As mentioned in section on the relationship be- 
tween courses, the setup of the master programme is a 
coherent effort centered around the themes of Open 
Technology and Security. “System” topics and “‘Net- 
work” topics are considered to be of equal importance 
and are treated as a unified duality throughout the 
year. The program is completely fixed — apart from the 
obvious choices in subject for the research projects — 
and courses are not accessible for students outside the 
programme. This didactic concept creates a very tight 
social relationship between all students and is the ba- 
sis for a kind of melting pot in which teachers and stu- 
dents together get the maximum possible result from 
the efforts put into the programme. It is the firm belief 
of the educators in Amsterdam that such a concept is 
indeed necessary to be able to graduate within only 
one year at a sufficiently academic level, entering 
from a higher professional education level. The didac- 
tic concept alone is not sufficient to succeed. Students 
also need to belong to the top layer of the higher pro- 
fessional education and bring along an excellent moti- 
vation to be able to cope with the demanding year, 
both in time as in energy to be invested. 


Master Education Programmes in Network and System Administration 


A major contribution to the melting pot is the so- 
called SNE Lab. This is a room suited for multiple 
purposes, accommodating about 20 students with a 
lecture room for centralised teaching, a desktop envi- 
ronment for each student, a dedicated network for pro- 
duction and experiments, a dedicated VM-based serv- 
er for each student for projects and experiments and 
some space for social activities. A major part of the to- 
tal study time is spent in the SNE Lab, also contribut- 
ing to better graduation rates and opportunities to help 
each other out in case of problems. 


As is the case in Oslo, students do not always have 
the prerequisite experience. Unconditional acceptance 
into the master’s degree programme in SNE is possible 
for students that have successfully completed one of the 
university bachelor’s degree programmes in Information 
Science. Admission into our vocationally oriented mas- 
ter’s degree programme is also possible for students that 
have successfully completed a programme at an institu- 
tion of higher professional education (HBO), on condi- 
tion that they pass an extensive intake examination or 
assessment. In practice, the majority of applications ap- 
pear to come from students who have completed HBO 
degree programmes. The following components are part 
of the intake procedure: 

General Skills 
e Literature skills: reading and summarising a 
technical document. 
¢ Oral skills: presenting a previously written the- 
sis. 
¢ Analytical skills: basic knowledge of discrete 
mathematics. 


Specific Skills 
e Basic knowledge of UNIX. 


e Basic network knowledge (TCP/IP). 
e Basic knowledge of scripting (shell). 


Within the actual programme, the ESA course 
offers students the opportunity to brush up on the 
knowledge and skills that they will need to be able to 
follow the rest of the programme successfully. 


The master does not result in a regular thesis 
project. The one-year format demands another con- 
struction here. Most students have already written a 
thesis for their bachelor’s. The competence added for 
the SNE master is the ability to research an academic 
topic in a very short time (one month) and write a con- 
sultancy report about it. All learned skills and knowl- 
edge come together within these projects as far as re- 
searching, communicating, writing and presenting is 
concerned. A disadvantage of this setup is that the 
scope of research has to be restricted to fit within the 
one month period. An advantage is that producing 
timely and short reports giving insight into a problem 
is of great practical advantage in real life situations. 
Moreover this format creates room to exercise these 
skills twice a year (in January and in June). 


An often heard remark made by students is that 
they learn in a short time more in our education than 
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in years before in other educations. This is almost cer- 
tainly due to the didactic concept and also shows that 
students are able to perform better under the right cir- 
cumstances. 


Our education is in Dutch, because we do not 
aim at recruiting international students right now. A 
larger group than 20 students would also put stress on 
the didactic concept. All teaching material however is 
in English and a thorough understanding of English in 
listening, reading and writing is necessary. Most study 
material is presented in the form of slides, accompany- 
ing the lectures, or is available online on the Internet. 
For some courses we make use of books of renowned 
authors in the subject field like [16, 17, 18]. 


Employment of the Students 
Oslo 


Our former students have worked in a variety of 

locations. A few of these are listed below. 

¢ Norwegian Cancer Research Fund 

© Opera Software 

° IBM Norway 

¢ The Armed Forces Computer Facility 

¢ Oslo university Computer Operations Center 

° Telia 

¢ Norsk Hydro 

© Oslo Data Center (ODS) 

e Our group (as system administrator and teacher) 

° Freecode 

e Linpro AS 
Amsterdam 


Our alumni have a very good prospect as much 
wanted professionals in a variety of employment op- 
portunities. To name a few: 

¢ Continued research in a Ph.D. position 

¢ Consultancy in small and large advising firms 

e Lecturer in higher education 

¢ Network Operations with ISPs and big infra- 
structure providers 

e ICT management in small and large companies 

e System architect and engineer 

e Security specialist and auditor 


Summary 


This paper is a statement of impressions. It is 
clearly too early to make any scientific evaluation of 
the success of our experiments. Nonetheless we are 
confident enough in our results to report on our activi- 
ties and encourage others to follow suit. The empirical 
quality of our reporting will improve as more years of 
experience can be collated. With four year of experi- 
ence we can only assert that, subject to minor adjust- 
ments, our programmes have succeeded well in their 
general goals, but that there is still room for improve- 
ment in the details. 

We struggle occasionally with resources, but we 
have identified a strategy for rationalizing teaching 
burdens by expanding the curriculum marginally. 
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We continue to monitor and discuss the results of 
our programmes between our universities, making ad- 
justments on the fly. We believe that these programmes 
have been a success not only in terms of numbers of 
students emerging from the conveyer belt, but also in 
generating an academic cohesion within our groups, 
stimulating research, and attracting doctoral students, 
post-docs and international visitors who have enriched 
our computer science milieux for everyone. 
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Appendix: Course Descriptions from Oslo 
Network and System Administration I 


The aim of this course is to provide an under- 
standing of the role and procedures carried out by net- 
work and system administrators. It outlines general 
principles, while providing concrete hands-on exam- 
ples using [6]. 

Computer and Information Security 


An introduction to the theory and concepts of se- 
curity, as applied to computer systems. The course 
builds on the earlier course on system administration, 
where security was discussed from a practical view- 
point. A deeper understanding of the principles and 
examples of security is explored, going beyond the en- 
cryption style of many courses on security. Common 
problems in software design are also noted. 


Networking: Technologies and Principles 


Explaining the principles and practice of net- 
working technologies and protocols that are used to- 
day, for transferring and routing information between 
hosts. Analysis is a key part of understanding net- 
working, and students are expected to develop a suffi- 
cient understanding of the physics and mathematics of 
communication and traffic flow to gain full marks. 


The course introduces a theoretical foundation 
for the coming laboratory course, and the exercises 
teach some practical and diagnostic skills in using 
these protocols, including tools such as traceroute, and 
the router operating systems IOS/JunOS. 


Intrusion Detection and Firewall Security 
This course introduces the fundamentals of 
TCP/IP network monitoring at the packet level and its 


application to computer security. Detecting and pro- 
tecting against hostile network activity has become 
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one of the important topics of Network and System 
Administration. 


Social and Ethical Aspects of Systems and Re- 
search 


Originally two separate courses, this module now 
consists of two courses under one umbrella. In both 
these modules students are expected to read and write 
about system administration. In research papers, they 
read the academic literature published in the field and 
summarize it. This is an important skill for the master 
thesis and for comprehension in a professional setting. 
In the Social and Ethical component, students are ex- 
pected to formulate a standpoint of their own and ar- 
gue for it. This develops communication skills, com- 
prehension skills and powers of rational judgment. 


Network Infrastructure and Security Lab 


The aim of this course is to give students sub- 
stantial experience in using networking and system ad- 
ministration equipment in as realistic an environment 
as we can create. Students should become proficient at 
handling hardware and software, as well as learn de- 
bugging skills and scientific methodology. Experi- 
ments carried out in the lab must be documented as 
scientific experiments and written up in clear, concise 
English. Presentation skills contribute to the grade, as 
well as documentation of analytical ability and sys- 
tematic work practice. Students are graded on perfor- 
mance, reporting, tidiness and safety. 


Network and System Administration II 


This is a course in essential services and soft- 
ware tools and application services such as DNS, 
NFS, SMB etc. It includes understanding integration 
of Microsoft and Macintosh OS and network models, 
evaluating and gaining experience of data backup and 
archiving models. Directory Services, Database Ad- 
ministration, automation of maintenance of a LAN 
with Cfengine are covered. Economical aspects of sys- 
tem management in planning and resource manage- 
ment are discussed. 


Analytical Methods for Systems 


The aim of this course is to provide a deeper per- 
spective on systems and their administration from a 
theoretical and cultural viewpoint. Only such a back- 
ground can stimulate research and creative solutions in 
the future. The course places system administration in 
the contexts of other subjects. Students must master 
some basic modeling techniques and be able to apply 
simple mathematical methods to problems of planning 
and analysis of human-computer systems. The results 
are applied to business process modeling also. 


High Volume Computing Services 


The aim of this course is to cover the fundamen- 
tals of implementing scalable computing systems for 
parallel computation and service delivery. This includes 
an understanding of the science of data centers where 
large installations are kept, and understanding the 
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meaning of scalability. Students should know the dif- 
ference between High Performance Computing, High 
Availability Computing and High Volume Computing 
and know how Amdahl’s law applies to parallelization 
“speedup” and load balancing. Some basic queueing 
theory is learned. 


Supercomputers and Virtual Operating Systems 


This course is designed to fill a specific need for 
the computing industry: experience and understanding 
of supercomputers and virtual operating systems. Spe- 
cific attention is given the IBM operating systems zOS 
(OS/390) and VM that are widely deployed in finance 
and banking sectors but which are rarely covered in 
university curricula. 


Final Thesis 


A one semester project, planned and executed by 
the student. 


Course Descriptions from Amsterdam 


Essential Skills for Administrators (ESA) 


This course forms the foundation for much of the 
daily work of a system administrator. If the use of 
open standards and open source software is to be ad- 
vocated credibly, system administrators must adhere to 
these standards as well. In the area of documentation, 
attention is paid to (pdf)(La)TeX, and XHTML is con- 
sidered for Web purposes. Version tools, including 
RCS, CVS and SVN are also addressed, as is the use 
of secure remote log-in (SSH) and secure communica- 
tion (PGP, GPG). Finally, a number of scripting lan- 
guages (shell, Perl, Python, Tcl/Tk and Ruby) are dis- 
cussed. Written reporting is an important component 
of the course. 


Classical Internet Applications (CIA) 


The aim of the course is to understand basic ar- 
chitectural issues in classical client-server environ- 
ments. Topics covered are: 

° Historical awareness of the development of In- 
ternet and UNIX. 

¢ Insight into and knowledge of the most impor- 
tant classical client-server applications (DNS, 

Email, Web and Directory Services). 

¢ Understanding the role of security in designing 
systems that must carry out the identified ser- 
vices. 


Security of Systems and Networks (SSN) 


Systems are secured according to a variety of prin- 
ciples, including plain-text passwords, one-time pass- 
words, encrypted passwords, public/private keys and 
certificates. Networks are secured with firewalls and en- 
cryption on the network layer. The topics that are ad- 
dressed in this course include remote access using SSH, 
secure Web transactions using SSL/TLS, single sign-on 
using Kerberos, secure email using PGP/GPG, IPsec 
and key management. The course also considers the 
problems of wireless access and WEP. Many of the 
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security systems that are mentioned for these purposes 
are based on the encoding of information (encryption). 
The course also addresses the (mathematical) principles 
of cryptography. The following skills receive special 
emphasis: 

e Evaluation of security technology. 

© Cooperation in groups of two and four. 

¢ Independent literature searches. 

e Written reporting. 
Distributed Internet Applications (DIA) 


The issues concerning the development of middle- 
ware systems for large-scale computer networks are 
discussed. Principles taught include communication, 
processes, naming, consistency and replication, fault 
tolerance, and security. These principles are further ex- 
plained by means of different paradigms applied to dis- 
tributed systems: object-based systems, distributed file 
systems (NFS), document-based systems (the Web), 
and coordination-based systems (publish/subscribe sys- 
tems). Explicit attention is paid to the practical feasibil- 
ity and scalability of various solutions. For this reason, 
experimental (research) systems as well as commercial- 
ly available systems are discussed. 


InterNetworking and Routing (INR) 


This course is all about the world of ISPs and 
transport providers in the Internet. Topics discussed 
are: 

¢ Mathematical modeling of addressing and rout- 
ing in the Internet (both IPv4 and IPv6). 

° Insight into and knowledge of the abstract and 
concrete algorithms that are used in routing 
systems. 

° Insight into and knowledge of the virtualisation 
techniques that can be used to study networks 
and their routing systems (focusing on RIP, 
OSPF and BGP). 


Large Installation Administration (LIA) 


The daily tasks of system administrators and the 
concepts with which they should be familiar are the 
focus of this course, which addresses the design, im- 
plementation and documentation of procedures for 
daily administration. Security, stability and manage- 
ability are primary requirements in this regard. The 
course addresses account management, storage man- 
agement and version management. Particular attention 
will be paid to the administration of complex systems 
and networks in large organisations. The course also 
addresses ITIL, PRINCE2 and other technologies. Fi- 
nally, presentation skills are emphasised in this course 
as well. 


Intrusion Detection Systems (IDS) 


This course focuses on methods and techniques 
to detect anomalous behaviour, to report on it and to 
take appropriate measures. Topics included are: 

e Examination of hacker techniques and the study 
of security systems from a hacker’s perspective. 
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° Detection techniques, including Intrusion De- 
tection Systems and Honeynets. 

¢ Protocol analysis and knowledge of tools. 

© Rootkit and malware technology. 

© Cooperation in groups of three or four and sub- 
mission of a proof of concept. 

¢ Independent literature searches. 

¢ Ethical aspects of security and network technol- 
ogy. 

¢ Presentation of research results and written re- 
porting. 

ICT and Company Practice (ICP) 


Master-level positions within organisations re- 
quire the capacity to form and defend well-founded 
opinions concerning business-related ICT issues. Those 
who hold such positions serve as discussion partners for 
a wide range of managers, advisors and policy makers, 
many of whom are not technically oriented. 


A well-founded opinion is formed through the 
acquisition of knowledge. The ability to defend an 
opinion requires good communication (by means of 
presentations and written reports). 


The objectives of the ICP course can be formu- 
lated as follows: 

e The acquisition of knowledge concerning busi- 
ness-oriented ICT issues. Relevant topics are 
sourcing, legacy and information security. These 
topics are chosen specifically because of their 
interface with the work and responsibilities of 
system and network administrators. 

Critical evaluation of a non-technical scientific 
article in the area of ICT. 

Writing a non-technical scientific article in the 
area of ICT. 


Research Projects 1 and 2 (RP1, RP2) 


The course objective is to ensure that students 
become acquainted with problems from the field of 
practice through two short projects, which require the 
development of non-trivial methods, concepts and so- 
lutions. Each course is a very intensive one-month 
full-time activity. After these courses, students should 
be able to: 

¢ Transform a roughly outlined problem into a 
carefully defined question, supported by litera- 
ture on the topic. 

Establish a feasible project schedule for an- 
swering the question. 

Conduct autonomous research to answer the 
question at hand, using literature searches, study- 
ing, experimentation and/or the development of 
software and hardware. 

Present solutions to a diverse audience (experts 
as well as non-experts). 

e Defend solutions in debates. 

Provide an appropriate report that is useful to a 
client. 
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On Designing and Deploying 
Internet-Scale Services 


James Hamilton — Windows Live Services Platform 


ABSTRACT 


The system-to-administrator ratio is commonly used as a rough metric to understand adminis- 
trative costs in high-scale services. With smaller, less automated services this ratio can be as low as 
2:1, whereas on industry leading, highly automated services, we’ve seen ratios as high as 2,500:1. 
Within Microsoft services, Autopilot [1] is often cited as the magic behind the success of the Win- 
dows Live Search team in achieving high system-to-administrator ratios. While auto-administration 
is important, the most important factor is actually the service itself. Is the service efficient to auto- 
mate? Is it what we refer to more generally as operations-friendly? Services that are operations- 
friendly require little human intervention, and both detect and recover from all but the most obscure 
failures without administrative intervention. This paper summarizes the best practices accumulated 
Over many years in scaling some of the largest services at MSN and Windows Live. 


Introduction 


This paper summarizes a set of best practices for 
designing and developing operations-friendly services. 
This is a rapidly evolving subject area and, consequent- 
ly, any list of best practices will likely grow and morph 
over time. Our aim is to help others 

1. deliver operations-friendly services quickly and 

2. avoid the early morning phone calls and meet- 
ings with unhappy customers that non-opera- 
tions-friendly services tend to yield. 


The work draws on our experiences over the last 
20 years in high-scale data-centric software systems 
and internet-scale services, most recently from leading 
the Exchange Hosted Services team (at the time, a mid- 
sized service of roughly 700 servers and just over 2.2M 
users). We also incorporate the experiences of the Win- 
dows Live Search, Windows Live Mail, Exchange 
Hosted Services, Live Communications Server, Win- 
dows Live Address Book Clearing House (ABCH), 
MSN Spaces, Xbox Live, Rackable Systems Engineer- 
ing Team, and the Messenger Operations teams in ad- 
dition to that of the overall Microsoft Global Founda- 
tion Services Operations team. Several of these con- 
tributing services have grown to more than a quarter 
billion users. The paper also draws heavily on the work 
done at Berkeley on Recovery Oriented Computing [2, 
3] and at Stanford on Crash-Only Software [4, 5]. 


Bill Hoffman [6] contributed many best practices 
to this paper, but also a set of three simple tenets 
worth considering up front: 

1. Expect failures. A component may crash or be 
stopped at any time. Dependent components 
might fail or be stopped at any time. There will 
be network failures. Disks will run out of space. 
Handle all failures gracefully. 

2. Keep things simple. Complexity breeds prob- 
lems. Simple things are easier to get right. 
Avoid unnecessary dependencies. Installation 


should be simple. ailures on one server should 
have no impact on the rest of the data center. 

3. Automate everything. People make mistakes. 
People need sleep. People forget things. Auto- 
mated processes are testable, fixable, and there- 
fore ultimately much more reliable. Automate 
wherever possible. 


These three tenets form a common thread through- 
out much of the discussion that follows. 


Recommendations 


This section is organized into ten sub-sections, 
each covering a different aspect of what is required to 
design and deploy an operations-friendly service. These 
sub-sections include overall service design; designing 
for automation and provisioning; dependency manage- 
ment; release cycle and testing; hardware selection and 
standardization; operations and capacity planning; au- 
diting, monitoring and alerting; graceful degradation 
and admission control; customer and press communica- 
tions plan; and customer self provisioning and self help. 


Overall Application Design 


We have long believed that 80% of operations is- 
sues originate in design and development, so this sec- 
tion on overall service design is the largest and most 
important. When systems fail, there is a natural ten- 
dency to look first to operations since that is where the 
problem actually took place. Most operations issues, 
however, either have their genesis in design and devel- 
opment or are best solved there. 


Throughout the sections that follow, a consensus 
emerges that firm separation of development, test, and 
operations isn’t the most effective approach in the ser- 
vices world. The trend we’ve seen when looking 
across many services is that low-cost administration 
correlates highly with how closely the development, 
test, and operations teams work together. 
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In addition to the best practices on service design 
discussed here, the subsequent section, “Designing for 
Automation Management and Provisioning,” also has 
substantial influence on service design. Effective auto- 
matic management and provisioning are generally 
achieved only with a constrained service model. This 
is a repeating theme throughout: simplicity is the key 
to efficient operations. Rational constraints on hard- 
ware selection, service design, and deployment mod- 
els are a big driver of reduced administrative costs and 
greater service reliability. 


Some of the operations-friendly basics that have 
the biggest impact on overall service design are: 

e Design for failure. This is a core concept when 
developing large services that comprise many 
cooperating components. Those components will 
fail and they will fail frequently. The compo- 
nents don’t always cooperate and fail indepen- 
dently either. Once the service has scaled beyond 
10,000 servers and 50,000 disks, failures will oc- 
cur multiple times a day. If a hardware failure re- 
quires any immediate administrative action, the 
service simply won’t scale cost-effectively and 
reliably. The entire service must be capable of 
surviving failure without human administrative 
interaction. Failure recovery must be a very sim- 
ple path and that path must be tested frequently. 
Armando Fox of Stanford [4, 5] has argued that 
the best way to test the failure path is never to 
shut the service down normally. Just hard-fail it. 
This sounds counter-intuitive, but if the failure 
paths aren’t frequently used, they won’t work 
when needed [7]. 

e Redundancy and fault recovery. The mainframe 
model was to buy one very large, very expensive 
server. Mainframes have redundant power sup- 
plies, hot-swappable CPUs, and exotic bus ar- 
chitectures that provide respectable I/O through- 
put in a single, tightly-coupled system. The ob- 
vious problem with these systems is their ex- 
pense. And even with all the costly engineering, 
they still aren’t sufficiently reliable. In order to 
get the fifth 9 of reliability, redundancy is re- 
quired. Even getting four 9’s on a single-system 
deployment is difficult. This concept is fairly 
well understood industry-wide, yet it’s still com- 
mon to see services built upon fragile, non-re- 
dundant data tiers. 

Designing a service such that any system can 
crash (or be brought down for service) at any 
time while still meeting the service level agree- 
ment (SLA) requires careful engineering. The 
acid test for full compliance with this design 
principle is the following: is the operations 
team willing and able to bring down any server 
in the service at any time without draining the 
work load first? If they are, then there is syn- 
chronous redundancy (no data loss), failure 
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detection, and automatic take-over. As a design 
approach, we recommend one commonly used 
approach to find and correct potential service 
security issues: security threat modeling. In se- 
curity threat modeling [8], we consider each 
possible security threat and, for each, imple- 
ment adequate mitigation. The same approach 
can be applied to designing for fault resiliency 
and recovery. 
Document all conceivable component failures 
modes and combinations thereof. For each fail- 
ure, ensure that the service can continue to op- 
erate without unacceptable loss in service quali- 
ty, or determine that this failure risk is accept- 
able for this particular service (e.g., loss of an 
entire data center in a non-geo-redundant ser- 
vice). Very unusual combinations of failures 
may be determined sufficiently unlikely that 
ensuring the system can operate through them 
is uneconomical. Be cautious when making this 
judgment. We’ve been surprised at how fre- 
quently ‘‘unusual” combinations of events take 
place when running thousands of servers that 
produce millions of opportunities for compo- 
nent failures each day. Rare combinations can 
become commonplace. 

Commodity hardware slice. All components of 

the service should target a commodity hardware 

slice. For example, storage-light servers will be 
dual socket, 2- to 4-core systems in the $1,000 
to $2,500 range with a boot disk. Storage-heavy 
servers are similar servers with 16 to 24 disks. 

The key observations are: 

1. large clusters of commodity servers are 
much less expensive than the small num- 
ber of large servers they replace, 

2. server performance continues to increase 
much faster than I/O performance, making 
a small server a more balanced system for 
a given amount of disk, 

3. power consumption scales linearly with 
servers but cubically with clock frequency, 
making higher performance servers more 
expensive to operate, and 

4. a small server affects a smaller proportion 
of the overall service workload when fail- 
ing over. 

Single-version software. Two factors that make 

some services less expensive to develop and 

faster to evolve than most packaged products 
are 

o the software needs to only target a single 
internal deployment and 

© previous versions don’t have to be support- 
ed for a decade as is the case for enter- 
prise-targeted products. 

Single-version software is relatively easy to 

achieve with a consumer service, especially one 
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provided without charge. But it’s equally impor- 
tant when selling subscription-based services to 
non-consumers. Enterprises are used to having 
significant influence over their software providers 
and to having complete control over when they 
deploy new versions (typically slowly). This 
drives up the cost of their operations and the 
cost of supporting them since so many versions 
of the software need to be supported. 
The most economic services don’t give cus- 
tomers control over the version they run, and 
only host one version. Holding this single-ver- 
sion software line requires 
1. care in not producing substantial user ex- 
perience changes release-to-release and 
2. a willingness to allow customers that need 
this level of control to either host internally 
or switch to an application service provider 
willing to provide this people-intensive 
multi-version support. 
¢ Multi-tenancy. Multi-tenancy is the hosting of 
all companies or end users of a service in the 
same service without physical isolation, where- 
as single tenancy is the segregation of groups of 
users in an isolated cluster. The argument for 
multi-tenancy is nearly identical to the argu- 
ment for single version support and is based up- 
on providing fundamentally lower cost of ser- 
vice built upon automation and large-scale. 


In review, the basic design tenets and considera- 

tions we have laid out above are: 

° design for failure, 

¢ implement redundancy and fault recovery, 

¢ depend upon a commodity hardware slice, 

® support single-version software, and 

¢ enable multi-tenancy. 
We are constraining the service design and operations 
model to maximize our ability to automate and to re- 
duce the overall costs of the service. We draw a clear 
distinction between these goals and those of applica- 
tion service providers or IT outsourcers. Those busi- 
nesses tend to be more people intensive and more will- 
ing to run complex, customer specific configurations. 


More specific best practices for designing opera- 
tions-friendly services are: 

° Quick service health check. This is the services 
version of a build verification test. It’s a sniff 
test that can be run quickly on a developer’s 
system to ensure that the service isn’t broken in 
any substantive way. Not all edge cases are test- 
ed, but if the quick health check passes, the 
code can be checked in. 

° Develop in the full environment. Developers 
should be unit testing their components, but 
should also be testing the full service with their 
component changes. Achieving this goal effi- 
ciently requires single-server deployment (sec- 
tion 2.4), and the preceding best practice, a 
quick service health check. 
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Zero trust of underlying components. Assume 
that underlying components will fail and ensure 
that components will be able to recover and con- 
tinue to provide service. The recovery technique 
is service-specific, but common techniques are to 
© continue to operate on cached data in read- 
only mode or 
© continue to provide service to all but a tiny 
fraction of the user base during the short 
time while the service is accessing the re- 
dundant copy of the failed component. 
Do not build the same functionality in multiple 
components. Foreseeing future interactions is 
hard, and fixes have to be made in multiple 
parts of the system if code redundancy creeps 
in. Services grow and evolve quickly. Without 
care, the code base can deteriorate rapidly. 
One pod or cluster should not affect another pod 
or cluster. Most services are formed of pods or 
sub-clusters of systems that work together to pro- 
vide the service, where each pod is able to oper- 
ate relatively independently. Each pod should be 
as close to 100% independent and without inter- 
pod correlated failures. Global services even with 
redundancy are a central point of failure. Some- 
times they cannot be avoided but try to have ev- 
erything that a cluster needs inside the clusters. 
Allow (rare) emergency human intervention. The 
common scenario for this is the movement of us- 
er data due to a catastrophic event or other emer- 
gency. Design the system to never need human 
interaction, but understand that rare events will 
occur where combined failures or unanticipated 
failures require human interaction. These events 
will happen and operator error under these cir- 
cumstances is a common source of catastrophic 
data loss. An operations engineer working under 
pressure at 2 a.m. will make mistakes. Design 
the system to first not require operations inter- 
vention under most circumstances, but work 
with operations to come up with recovery plans 
if they need to intervene. Rather than docu- 
menting these as multi-step, error-prone proce- 
dures, write them as scripts and test them in 
production to ensure they work. What isn’t test- 
ed in production won’t work, so periodically 
the operations team should conduct a “fire 
drill” using these tools. If the service-availabil- 
ity risk of a drill is excessively high, then insuf- 
ficient investment has been made in the design, 
development, and testing of the tools. 
Keep things simple and robust. Complicated al- 
gorithms and component interactions multiply 
the difficulty of debugging, deploying, etc. 
Simple and nearly stupid is almost always bet- 
ter in a high-scale service-the number of inter- 
acting failure modes is already daunting before 
complex optimizations are delivered. Our gen- 
eral rule is that optimizations that bring an 
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order of magnitude improvement are worth 
considering, but percentage or even small fac- 
tor gains aren’t worth it. 

Enforce admission control at all levels. Any 
good system is designed with admission control 
at the front door. This follows the long-under- 
stood principle that it’s better to not let more 
work into an overloaded system than to contin- 
ue accepting work and beginning to thrash. 
Some form of throttling or admission control is 
common at the entry to the service, but there 
should also be admission control at all major 
components boundaries. Work load characteris- 
tic changes will eventually lead to sub-compo- 
nent overload even though the overall service is 
operating within acceptable load levels. See the 
note below in section 2.8 on the “big red 
switch” as one way of gracefully degrading un- 
der excess load. The general rule is to attempt 
to gracefully degrade rather than hard failing 
and to block entry to the service before giving 
uniform poor service to all users. 

Partition the service. Partitions should be infin- 
itely-adjustable and fine-grained, and not be 
bounded by any real world entity (person, col- 
lection ...). If the partition is by company, then 
a big company will exceed the size of a single 
partition. If the partition is by name prefix, then 
eventually all the P’s, for example, won’t fit on 
a single server. We recommend using a look-up 
table at the mid-tier that maps fine-grained enti- 
ties, typically users, to the system where their 
data is managed. Those fine-grained partitions 
can then be moved freely between servers. 
Understand the network design. Test early to 
understand what load is driven between servers 
in a rack, across racks, and across data centers. 
Application developers must understand the 
network design and it must be reviewed early 
with networking specialists on the operations 
team. 

Analyze throughput and latency. Analysis of the 
throughput and latency of core service user in- 
teractions should be performed to understand 
impact. Do so with other operations running 
such as regular database maintenance, opera- 
tions configuration (new users added, users mi- 
grated), service debugging, etc. This will help 
catch issues driven by periodic management 
tasks. For each service, a metric should emerge 
for capacity planning such as user requests per 
second per system, concurrent on-line users per 
system, or some related metric that maps rele- 
vant work load to resource requirements. 

Treat operations utilities as part of the service. 
Operations utilities produced by development, 
test, program management, and operations should 
be code-reviewed by development, checked into 
the main source tree, and tracked on the same 
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schedule and with the same testing. Frequently 
these utilities are mission critical and yet nearly 
untested. 

e Understand access patterns. When planning 
new features, always consider what load they 
are going to put on the backend store. Often the 
service model and service developers become 
so abstracted away from the store that they lose 
sight of the load they are putting on the under- 
lying database. A best practice is to build it into 
the specification with a section such as, “What 
impacts will this feature have on the rest of the 
infrastructure?” Then measure and validate the 
feature for load when it goes live. 

e Version everything. Expect to run in a mixed- 
version environment. The goal is to run single 
version software but multiple versions will be 
live during rollout and production testing. Ver- 
sions n and n+] of all components need to co- 
exist peacefully. 

e Keep the unit/functional tests from the previous 
release. These tests are a great way of verifying 
that version n-1 functionality doesn’t get bro- 
ken. We recommend going one step further and 
constantly running service verification tests in 
production (more detail below). 

e Avoid single points of failure. Single points of 
failure will bring down the service or portions 
of the service when they fail. Prefer stateless 
implementations. Don’t affinitize requests or 
clients to specific servers. Instead, load balance 
over a group of servers able to handle the load. 
Static hashing or any static work allocation to 
servers will suffer from data and/or query skew 
problems over time. Scaling out is easy when 
machines in a class are interchangeable. Data- 
bases are often single points of failure and data- 
base scaling remains one of the hardest prob- 
lems in designing internet-scale services. Good 
designs use fine-grained partitioning and don’t 
support cross-partition operations to allow effi- 
cient scaling across many database servers. All 
database state is stored redundantly (on at least 
one) fully redundant hot standby server and 
failover is tested frequently in production. 


Automatic Management and Provisioning 


Many services are written to alert operations on 
failure and to depend upon human intervention for re- 
covery. The problem with this model starts with the 
expense of a 24x7 operations staff. Even more impor- 
tant is that if operations engineers are asked to make 
tough decisions under pressure, about 20% of the time 
they will make mistakes. The model is both expensive 
and error-prone, and reduces overall service reliability. 


Designing for automation, however, involves sig- 
nificant service-model constraints. For example, some 
of the large services today depend upon database sys- 
tems with asynchronous replication to a secondary, 
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back-up server. Failing over to the secondary after the 
primary isn’t able to service requests loses some cus- 
tomer data due to replicating asynchronously. However, 
not failing over to the secondary leads to service down- 
time for those users whose data is stored on the failed 
database server. Automating the decision to fail over is 
hard in this case since its dependent upon human judg- 
ment and accurately estimating the amount of data loss 
compared to the likely length of the down time. A sys- 
tem designed for automation pays the latency and 
throughput cost of synchronous replication. And, hav- 
ing done that, failover becomes a simple decision: if the 
primary is down, route requests to the secondary. This 
approach is much more amenable to automation and is 
considerably less error prone. 


Automating administration of a service after de- 
sign and deployment can be very difficult. Successful 
automation requires simplicity and clear, easy-to-make 
operational decisions. This in turn depends on a care- 
ful service design that, when necessary, sacrifices 
some latency and throughput to ease automation. The 
trade-off is often difficult to make, but the administra- 
tive savings can be more than an order of magnitude 
in high-scale services. In fact, the current spread be- 
tween the most manual and the most automated ser- 
vice we’ve looked at is a full two orders of magnitude 
in people costs. 


Best practices in designing for automation include: 
¢ Be restartable and redundant. All operations must 
be restartable and all persistent state stored redun- 
dantly. 

Support geo-distribution. All high scale services 
should support running across several hosting 
data centers. In fairness, automation and most 
of the efficiencies we describe here are still 
possible without geo-distribution. But lacking 
support for multiple data center deployments 
drives up operations costs dramatically. With- 
out geo-distribution, it’s difficult to use free ca- 
pacity in one data center to relieve load on a 
service hosted in another data center. Lack of 
geo-distribution is an operational constraint that 
drives up costs. 
¢ Automatic provisioning and installation. Provi- 
sioning and installation, if done by hand, is 
costly, there are too many failures, and small 
configuration differences will slowly spread 
throughout the service making problem deter- 
mination much more difficult. 
¢ Configuration and code as a unit. Ensure that 

o the development team delivers the code 
and the configuration as a single unit, 

o the unit is deployed by test in exactly the 
same way that operations will deploy it, 
and 

© operations deploys them as a unit. 

Services that treat configuration and code as a 

unit and only change them together are often 

more reliable. 
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If a configuration change must be made in pro- 
duction, ensure that all changes produce an au- 
dit log record so it’s clear what was changed, 
when and by whom, and which servers were ef- 
fected (see section 2.7). Frequently scan all 
servers to ensure their current state matches the 
intended state. This helps catch install and con- 
figuration failures, detects server misconfigura- 
tions early, and finds non-audited server config- 
uration changes. 

e Manage server roles or personalities rather than 

servers. Every system role or personality should 

support deployment on as many or as few servers 
as needed. 

Multi-system failures are common. Expect fail- 

ures of many hosts at once (power, net switch, 

and rollout). Unfortunately, services with state 
will have to be topology-aware. Correlated fail- 
ures remain a fact of life. 

e Recover at the service level. Handle failures and 
correct errors at the service level where the full 
execution context is available rather than in 
lower software levels. For example, build re- 
dundancy into the service rather than depending 
upon recovery at the lower software layer. 

¢ Never rely on local storage for non-recoverable in- 
formation. Always replicate all the non-ephemeral 
service state. 

¢ Keep deployment simple. File copy is ideal as it 
gives the most deployment flexibility. Mini- 
mize external dependencies. Avoid complex in- 
stall scripts. Anything that prevents different 
components or different versions of the same 
component from running on the same server 
should be avoided. 

e Fail services regularly. Take down data centers, 
shut down racks, and power off servers. Regu- 
lar controlled brown-outs will go a long way to 
exposing service, system, and network weak- 
nesses. Those unwilling to test in production 
aren’t yet confident that the service will contin- 
ue operating through failures. And, without 
production testing, recovery won’t work when 
called upon. 


Dependency Management 


Dependency management in high-scale services 
often doesn’t get the attention the topic deserves. As a 
general rule, dependence on small components or ser- 
vices doesn’t save enough to justify the complexity of 
managing them. Dependencies do make sense when: 

1. the components being depended upon are sub- 
stantial in size or complexity, or 
2. the service being depended upon gains its value 
in being a single, central instance. 
Examples of the first class are storage and consensus 
algorithm implementations. Examples of the second 
class of are identity and group management systems. 
The whole value of these systems is that they are a 
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single, shared instance so multi-instancing to avoid 
dependency isn’t an option. 


Assuming that dependencies are justified accord- 
ing to the above rules, some best practices for manag- 
ing them are: 

e Expect latency. Calls to external components 
may take a long time to complete. Don’t let de- 
lays in one component or service cause delays 
in completely unrelated areas. Ensure all inter- 
actions have appropriate timeouts to avoid ty- 
ing up resources for protracted periods. Opera- 
tional idempotency allows the restart of re- 
quests after timeout even though those requests 
may have partially or even fully completed. En- 
sure all restarts are reported and bound restarts 
to avoid a repeatedly failing request from con- 
suming ever more system resources. 

¢ lsolate failures. The architecture of the site must 
prevent cascading failures. Always “‘fail fast.” 
When dependent services fail, mark them as 
down and stop using them to prevent threads 
from being tied up waiting on failed compo- 
nents. 

e Use shipping and proven components. Proven 
technology is almost always better than operat- 
ing on the bleeding edge. Stable software is 
better than an early copy, no matter how valu- 
able the new feature seems. This rule applies to 
hardware as well. Stable hardware shipping in 
volume is almost always better than the small 
performance gains that might be attained from 
early release hardware. 

e Implement inter-service monitoring and alerting. 
If the service is overloading a dependent ser- 
vice, the depending service needs to know and, 
if it can’t back-off automatically, alerts need to 
be sent. If operations can’t resolve the problem 
quickly, it needs to be easy to contact engineers 
from both teams quickly. All teams with depen- 
dencies should have engineering contacts on 
the dependent teams. 

e Dependent services require the same design 
point. Dependent services and producers of de- 
pendent components need to be committed to at 
least the same SLA as the depending service. 

e Decouple components. Where possible, ensure 
that components can continue operation, per- 
haps in a degraded mode, during failures of 
other components. For example, rather than re- 
authenticating on each connect, maintain a ses- 
sion key and refresh it every N hours indepen- 
dent of connection status. On reconnect, just 
use existing session key. That way the load on 
the authenticating server is more consistent and 
login storms are not driven on reconnect after 
momentary network failure and related events. 


Release Cycle and Testing 


Testing in production is a reality and needs to be 
part of the quality assurance approach used by all 
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internet-scale services. Most services have at least one 
test lab that is as similar to production as (affordably) 
possible and all good engineering teams use produc- 
tion workloads to drive the test systems realistically. 
Our experience has been, however, that as good as 
these test labs are, they are never full fidelity. They al- 
ways differ in at least subtle ways from production. As 
these labs approach the production system in fidelity, 
the cost goes asymptotic and rapidly approaches that 
of the production system. 


We instead recommend taking new service re- 
leases through standard unit, functional, and produc- 
tion test lab testing and then going into limited pro- 
duction as the final test phase. Clearly we don’t want 
software going into production that doesn’t work or 
puts data integrity at risk, so this has to be done care- 
fully. The following rules must be followed: 

1. the production system has to have sufficient re- 
dundancy that, in the event of catastrophic new 
service failure, state can be quickly be recov- 
ered, 

2. data corruption or state-related failures have to 
be extremely unlikely (functional testing must 
first be passing), 

3. errors must be detected and the engineering 
team (rather than operations) must be monitor- 
ing system health of the code in test, and 

4. it must be possible to quickly roll back all 
changes and this roll back must be tested before 
going into production. 

This sounds dangerous. But we have found that 
using this technique actually improves customer expe- 
rience around new service releases. Rather than de- 
ploying as quickly as possible, we put one system in 
production for a few days in a single data center. Then 
we bring one new system into production in each data 
center. Then we’ll move an entire data center into pro- 
duction on the new bits. And finally, if quality and 
performance goals are being met, we deploy globally. 
This approach can find problems before the service is 
at risk and can actually provide a better customer ex- 
perience through the version transition. Big-bang de- 
ployments are very dangerous. 


Another potentially counter-intuitive approach we 
favor is deployment mid-day rather than at night. At 
night, there is greater risk of mistakes. And, if anom- 
alies crop up when deploying in the middle of the 
night, there are fewer engineers around to deal with 
them. The goal is to minimize the number of engineer- 
ing and operations interactions with the system over- 
all, and especially outside of the normal work day, to 
both reduce costs and to increase quality. 


Some best practices for release cycle and testing 
include: 
e Ship often. Intuitively one would think that 
shipping more frequently is harder and more er- 
ror prone. We’ve found, however, that more 
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frequent releases have less big-bang changes. 
Consequently, the releases tend to be higher 
quality and the customer experience is much 
better. The acid test of a good release is that the 
user experience may have changed but the 
number of operational issues around availabili- 
ty and latency should be unchanged during the 
release cycle. We like shipping on 3-month cy- 
cles, but arguments can be made for other 
schedules. Our gut feel is that the norm will 
eventually be less than three months, and many 
services are already shipping on weekly sched- 
ules. Cycles longer than three months are dan- 
gerous. 

Use production data to find problems. Quality 
assurance in a large-scale system is a data-min- 
ing and visualization problem, not a testing 
problem. Everyone needs to focus on getting 
the most out of the volumes of data in a produc- 
tion environment. A few strategies are: 

o Measurable release criteria. Define specific 
criteria around the intended user experi- 
ence, and continuously monitor it. If avail- 
ability is supposed to be 99%, measure that 
availability meets the goal. Both alert and 
diagnose if it goes under. 

Tune goals in real time. Rather than getting 
bogged down deciding whether the goal 
should be 99% or 99.9% or any other goal, 
set an acceptable target and then ratchet it 
up as the system establishes stability in 
production. 

Always collect the actual numbers. Collect 
the actual metrics rather than red and green 
or other summary reports. Summary re- 
ports and graphs are useful but the raw 
data is needed for diagnosis. 

Minimize false positives. People stop paying 
attention very quickly when the data is incor- 
rect. It’s important to not over-alert or opera- 
tions staff will learn to ignore them. This is 
so important that hiding real problems as 
collateral damage is often acceptable. 
Analyze trends. This can be used for pre- 
dicting problems. For example, when data 
movement in the system diverges from the 
usual rate, it often predicts a bigger prob- 
lem. Exploit the available data. 

Make the system health highly visible. Re- 
quire a globally available, real-time dis- 
play of service health for the entire organi- 
zation. Have an internal website people 
can go at any time to understand the cur- 
rent state of the service. 

Monitor continuously. It bears noting that 
people must be looking at all the data ev- 
ery day. Everyone should do this, but make 
it the explicit job of a subset of the team to 
do this. 
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Invest in engineering. Good engineering mini- 
mizes operational requirements and solves prob- 
lems before they actually become operational is- 
sues. Too often, organizations grow operations to 
deal with scale and never take the time to engi- 
neer a scalable, reliable architecture. Services 
that don’t think big to start with will be scram- 
bling to catch up later. 

Support version roll-back. Version roll-back is 
mandatory and must be tested and proven be- 
fore roll-out. Without roll-back, any form of 
production-level testing in very high risk. Re- 
verting to the previous version is a rip cord that 
should always be available on any deployment. 
Maintain forward and backward compatibility. 
This vital point strongly relates to the previous 
one. Changing file formats, interfaces, logging/ 
debugging, instrumentation, monitoring and con- 
tact points between components are all potential 
risk. Don’t rip out support for old file formats 
until there is no chance of a roll back to that old 
format in the future. 

Single-server deployment. This is both a test 
and development requirement. The entire ser- 
vice must be easy to host on a single system. 
Where single-server deployment is impossible 
for some component (e.g., a dependency on an 
external, non-single box deployable service), 
write an emulator to allow single-server testing. 
Without this, unit testing is difficult and doesn’t 
fully happen. And if running the full system is 
difficult, developers will have a tendency to take 
a component view rather than a systems view. 
Stress test for load. Run some tiny subset of the 
production systems at twice (or more) the load 
to ensure that system behavior at higher than 
expected load is understood and that the sys- 
tems don’t melt down as the load goes up. 
Perform capacity and performance testing prior 
to new releases. Do this at the service level and 
also against each component since work load 
characteristics will change over time. Problems 
and degradations inside the system need to be 
caught early. 

Build and deploy shallowly and iteratively. Get a 
skeleton version of the full service up early in 
the development cycle. This full service may 
hardly do anything at all and may include 
shunts in places but it allows testers and devel- 
opers to be productive and it gets the entire 
team thinking at the user level from the very 
beginning. This is a good practice when build- 
ing any software system, but is particularly im- 
portant for services. 

Test with real data. Fork user requests or work- 
load from production to test environments. 
Pick up production data and put it in test envi- 
ronments. The diverse user population of the 
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product will always be most creative at finding 

bugs. Clearly, privacy commitments must be 

maintained so it’s vital that this data never 
leak back out into production. 

Run system-level acceptance tests. Tests that 

run locally provide sanity check that speeds it- 

erative development. To avoid heavy mainte- 
nance cost they should still be at system level. 

e Test and develop in full environments. Set aside 
hardware to test at interesting scale. Most im- 
portantly, use the same data collection and min- 
ing techniques used in production on these en- 
vironments to maximize the investment. 


Hardware Selection and Standardization 


The usual argument for SKU standardization is 
that bulk purchases can save considerable money. This 
is inarguably true. The larger need for hardware stan- 
dardization is that it allows for faster service deploy- 
ment and growth. If each service is purchasing their 
own private infrastructure, then each service has to: 

1. determine which hardware currently is the best 
cost/performing option, 
2. order the hardware, and 
3. do hardware qualification and software deploy- 
ment once the hardware is installed in the data 
center. 
This usually takes a month and can easily take more. 


A better approach is a “services fabric” that in- 
cludes a small number of hardware SKUs and the au- 
tomatic management and provisioning infrastructure 
on which all service are run. If more machines are 
needed for a test cluster, they are requested via a web 
service and quickly made available. If a small service 
gets more successful, new resources can be added 
from the existing pool. This approach ensures two vi- 
tal principles: 

1. all services, even small ones, are using the au- 
tomatic management and provisioning infra- 
structure and 

2. new services can be tested and deployed much 

more rapidly. 

Best practices for hardware selection include: 
Use only standard SKUs. Having a single or 
small number of SKUs in production allows re- 
sources to be moved fluidly between services 
as needed. The most cost-effective model is to 
develop a standard service-hosting framework 
that includes automatic management and provi- 
sioning, hardware, and a standard set of shared 
services. Standard SKUs is a core requirement 
to achieve this goal. 

Purchase full racks. Purchase hardware in fully 
configured and tested racks or blocks of multi- 
ple racks. Racking and stacking costs are inex- 
plicably high in most data centers, so let the 
system manufacturers do it and wheel in full 
racks. 

Write to a hardware abstraction. Write the service 
to an abstract hardware description. Rather than 
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fully exploiting the hardware SKU, the service 
should neither exploit that SKU nor depend up- 
on detailed knowledge of it. This allows the 
2-way, 4-disk SKU to be upgraded over time as 
better cost/performing systems come available. 
The SKU should be a virtual description that in- 
cludes number of CPUs and disks, and a mini- 
mum for memory. Finer-grained information 
about the SKU should not be exploited. 
Abstract the network and naming. Abstract the 
network and naming as far as possible, using 
DNS and CNAMEs. Always, always use a 
CNAME. Hardware breaks, comes off lease, 
and gets repurposed. Never rely on a machine 
name in any part of the code. A flip of the 
CNAME in DNS is a lot easier than changing 
configuration files, or worse yet, production 
code. If you need to avoid flushing the DNS 
cache, remember to set Time To Live suffi- 
ciently low to ensure that changes are pushed as 
quickly as needed. 


Operations and Capacity Planning 


The key to operating services efficiently is to 
build the system to eliminate the vast majority of oper- 
ations administrative interactions. The goal should be 
that a highly-reliable, 24x7 service should be main- 
tained by a small 8x5 operations staff. 


However, unusual failures will happen and there 
will be times when systems or groups of systems can’t 
be brought back on line. Understanding this possibili- 
ty, automate the procedure to move state off the dam- 
aged systems. Relying on operations to update SQL 
tables by hand or to move data using ad hoc tech- 
niques is courting disaster. Mistakes get made in the 
heat of battle. Anticipate the corrective actions the op- 
erations team will need to make, and write and test 
these procedures up-front. Generally, the development 
team needs to automate emergency recovery actions 
and they must test them. Clearly not all failures can be 
anticipated, but typically a small set of recovery ac- 
tions can be used to recover from broad classes of fail- 
ures. Essentially, build and test “recovery kernels” 
that can be used and combined in different ways de- 
pending upon the scope and the nature of the disaster. 


The recovery scripts need to be tested in produc- 
tion. The general rule is that nothing works if it isn’t 
tested frequently so don’t implement anything the 
team doesn’t have the courage to use. If testing in pro- 
duction is too risky, the script isn’t ready or safe for 
use in an emergency. The key point here is that disas- 
ters happen and it’s amazing how frequently a small 
disaster becomes a big disaster as a consequence of a 
recovery step that doesn’t work as expected. Antici- 
pate these events and engineer automated actions to 
get the service back on line without further loss of 
data or up time. 

e Make the development team responsible. Amazon 
is perhaps the most aggressive down this path 
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with their slogan “you built it, you manage it.” 
That position is perhaps slightly stronger than the 
one we would take, but it’s clearly the right gen- 
eral direction. If development is frequently called 
in the middle of the night, automation is the like- 
ly outcome. If operations is frequently called, the 
usual reaction is to grow the operations team. 
© Soft delete only. Never delete anything. Just mark 
it deleted. When new data comes in, record the 
requests on the way. Keep a rolling two week (or 
more) history of all changes to help recover from 
software or administrative errors. If someone 
makes a mistake and forgets the where clause on 
a delete statement (it has happened before and it 
will again), all logical copies of the data are 
deleted. Neither RAID nor mirroring can protect 
against this form of error. The ability to recover 
the data can make the difference between a high- 
ly embarrassing issue or a minor, barely notice- 
able glitch. For those systems already doing off- 
line backups, this additional record of data com- 
ing into the service only needs to be since the last 
backup. But, being cautious, we recommend go- 
ing farther back anyway. 
e Track resource allocation. Understand the costs of 
additional load for capacity planning. Every ser- 
vice needs to develop some metrics of use such 
as concurrent users online, user requests per sec- 
ond, or something else appropriate. Whatever the 
metric, there must be a direct and known correla- 
tion between this measure of load and the hard- 
ware resources needed. The estimated load num- 
ber should be fed by the sales and marketing 
teams and used by the operations team in capaci- 
ty planning. Different services will have different 
change velocities and require different ordering 
cycles. We’ve worked on services where we up- 
dated the marketing forecasts every 90 days, and 
updated the capacity plan and ordered equipment 
every 30 days. 
Make one change at a time. When in trouble, on- 
ly apply one change to the environment at a 
time. This may seem obvious, but we’ve seen 
many occasions when multiple changes meant 
cause and effect could not be correlated. 
Make everything configurable. Anything that has 
any chance of needing to be changed in produc- 
tion should be made configurable and tunable in 
production without a code change. Even if there 
is no good reason why a value will need to 
change in production, make it changeable as 
long as it is easy to do. These knobs shouldn’t 
be changed at will in production, and the system 
should be thoroughly tested using the configura- 
tion that is planned for production. But when a 
production problem arises, it is always easier, 
safer, and much faster to make a simple config- 
uration change compared to coding, compiling, 
testing, and deploying code changes. 
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Auditing, Monitoring and Alerting 


The operations team can’t instrument a service in 
deployment. Make substantial effort during develop- 
ment to ensure that performance data, health data, 
throughput data, etc. are all produced by every compo- 
nent in the system. 


Any time there is a configuration change, the ex- 
act change, who did it, and when it was done needs to 
be logged in the audit log. When production problems 
begin, the first question to answer is what changes 
have been made recently. Without a configuration au- 
dit trail, the answer is always “nothing” has changed 
and it’s almost always the case that what was forgotten 
was the change that led to the question. 


Alerting is an art. There is a tendency to alert on 
any event that the developer expects they might find 
interesting and so version-one services often produce 
reams of useless alerts which never get looked at. To 
be effective, each alert has to represent a problem. 
Otherwise, the operations team will learn to ignore 
them. We don’t know of any magic to get alerting cor- 
rect other than to interactively tune what conditions 
drive alerts to ensure that all critical events are alerted 
and there are not alerts when nothing needs to be 
done. To get alerting levels correct, two metrics can 
help and are worth tracking: 

1. alerts-to-trouble ticket ratio (with a goal of near 
one), and 

2. number of systems health issues without corre- 
sponding alerts (with a goal of near zero). 


Best practices include: 
¢ Instrument everything. Measure every customer 
interaction or transaction that flows through the 
system and report anomalies. There is a place 
for “runners” (synthetic workloads that simu- 
late user interactions with a service in produc- 
tion) but they aren’t close to sufficient. Using 
runners alone, we’ve seen it take days to even 
notice a serious problem, since the standard 
runner workload was continuing to be pro- 
cessed well, and then days more to know why. 
¢ Data is the most valuable asset. If the normal 
operating behavior isn’t well-understood, it’s 
hard to respond to what isn’t. Lots of data on 
what is happening in the system needs to be 
gathered to know it really is working well. 
Many services have gone through catastrophic 
failures and only learned of the failure when the 
phones started ringing. 
Have a customer view of service. Perform end-to- 
end testing. Runners are not enough, but they are 
needed to ensure the service is fully working. 
Make sure complex and important paths such as 
logging in a new user are tested by the runners. 
Avoid false positives. If a runner failure isn’t con- 
sidered important, change the test to one that is. 
Again, once people become accustomed to ignor- 
ing data, breakages won’t get immediate attention. 
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Instrument for production testing. In order to 
safely test in production, complete monitoring 
and alerting is needed. If a component is fail- 
ing, it needs to be detected quickly. 

Latencies are the toughest problem. Examples 
are slow I/O and not quite failing but process- 
ing slowly. These are hard to find, so instru- 
ment carefully to ensure they are detected. 

Have sufficient production data. In order to find 
problems, data has to be available. Build fine- 
grained monitoring in early or it becomes ex- 
pensive to retrofit later. The most important 
data that we’ve relied upon includes: 

o Use performance counters for all opera- 
tions. Record the latency of operations and 
number of operations per second at the 
least. The waxing and waning of these val- 
ues is a huge red flag. 

Audit all operations. Every time somebody 
does something, especially something sig- 
nificant, log it. This serves two purposes: 
first, the logs can be mined to find out 
what sort of things users are doing (in our 
case, the kind of queries they are doing) 
and second, it helps in debugging a prob- 
lem once it is found. 

A related point: this won’t do much good if 
everyone is using the same account to ad- 
minister the systems. A very bad idea but 
not all that rare. 

Track all fault tolerance mechanisms. Fault 
tolerance mechanisms hide failures. Track 
every time a retry happens, or a piece of 
data is copied from one place to another, or 
a machine is rebooted or a service restart- 
ed. Know when fault tolerance is hiding 
little failures so they can be tracked down 
before they become big failures. We had a 
2000-machine service fall slowly to only 
400 available over the period of a few days 
without it being noticed initially. 

Track operations against important entities. 
Make an “‘audit log” of everything signifi- 
cant that has happened to a particular enti- 
ty, be it a document or chunk of docu- 
ments. When running data analysis, it’s 
common to find anomalies in the data. 
Know where the data came from and what 
processing it’s been through. This is partic- 
ularly difficult to add later in the project. 
Asserts. Use asserts freely and throughout 
the product. Collect the resulting logs or 
crash dumps and investigate them. For 
systems that run different services in the 
same process boundary and can’t use as- 
serts, write trace records. Whatever the im- 
plementation, be able to flag problems and 
mine frequency of different problems. 


° 


oO 


oO 


° 


Hamilton 


o Keep historical data. Historical performance 
and log data is necessary for trending and 
problem diagnosis. 


¢ Configurable logging. Support configurable log- 


ging that can optionally be turned on or off as 
needed to debug issues. Having to deploy new 
builds with extra monitoring during a failure is 
very dangerous. 


e Expose health information for monitoring. Think 


about ways to externally monitor the health of 
the service and make it easy to monitor it in 
production. 


e Make all reported errors actionable. Problems 


will happen. Things will break. If an unrecover- 
able error in code is detected and logged or re- 
ported as an error, the error message should in- 
dicate possible causes for the error and suggest 
ways to correct it. Un-actionable error reports 
are not useful and, over time, they get ignored 
and real failures will be missed. 


e Enable quick diagnosis of production problems. 


o Give enough information to diagnose. When 
problems are flagged, give enough informa- 
tion that a person can diagnose it. Otherwise 
the barrier to entry will be too high and the 
flags will be ignored. For example, don’t just 
say “10 queries returned no results.” Add 
‘‘and here is the list, and the times they hap- 
pened.” 

Chain of evidence. Make sure that from be- 
ginning to end there is a path for developer 
to diagnose a problem. This is typically 
done with logs. 

Debugging in production. We prefer a model 
where the systems are almost never touched 
by anyone including operations and that de- 
bugging is done by snapping the image, 
dumping the memory, and shipping it out of 
production. When production debugging is 
the only option, developers are the best 
choice. Ensure they are well trained in what 
is allowed on production servers. Our expe- 
rience has been that the less frequently sys- 
tems are touched in production, the happier 
customers generally are. So we recommend 
working very hard on not having to touch 
live systems still in production. 

Record all significant actions. Every time the 
system does something important, particu- 
larly on a network request or modification 
of data, log what happened. This includes 
both when a user sends a command and 
what the system internally does. Having this 
record helps immensely in debugging prob- 
lems. Even more importantly, mining tools 
can be built that find out useful aggregates, 
such as, what kind of queries are users doing 
(i.e., which words, how many words, etc.) 
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Graceful Degradation and Admission Control 


There will be times when DOS attacks or some 
change in usage patterns causes a sudden workload 
spike. The service needs be able to degrade gracefully 
and control admissions. For example, during 9/11 
most news services melted down and couldn’t provide 
a usable service to any of the user base. Reliably de- 
livering a subset of the articles would have been a bet- 
ter choice. Two best practices, a “big red switch” and 
admission control, need to be tailored to each service. 
But both are powerful and necessary. 

e Support a “big red switch.” The idea of the 

“big red switch” originally came from Windows 

Live Search and it has a lot of power. We’ve 

generalized it somewhat in that more transac- 

tional services differ from Search in significant 
ways. But the idea is very powerful and applica- 

ble anywhere. Generally, a “big red switch” is a 

designed and tested action that can be taken 

when the service is no longer able to meet its 

SLA, or when that is imminent. Arguably refer- 

ring to graceful degradation as a “big red 

switch” is a slightly confusing nomenclature 
but what is meant is the ability to shed non-crit- 
ical load in an emergency. 

The concept of a big red switch is to keep the vi- 

tal processing progressing while shedding or de- 

laying some non-critical workload. By design, 
this should never happen, but it’s good to have 
recourse when it does. Trying to figure these out 
when the service is on fire is risky. If there is 
some load that can be queued and processed lat- 
er, it’s a candidate for a big red switch. If it’s 
possible to continue to operate the transaction 
system while disabling advance querying, that’s 
also a good candidate. The key thing is deter- 
mining what is minimally required if the system 
is in trouble, and implementing and testing the 
option to shut off the non-essential services 
when that happens. Note that a correct big red 
switch is reversible. Resetting the switch should 
be tested to ensure that the full service returns to 
operation, including all batch jobs and other pre- 
viously halted non-critical work. 

¢ Control admission. The second important con- 
cept is admission control. If the current load 
cannot be processed on the system, bringing 
more work load into the system just assures that 

a larger cross section of the user base is going 

to get a bad experience. How this gets done is 

dependent on the system and some can do this 
more easily than others. As an example, the last 
service we led processed email. If the system was 
over-capacity and starting to queue, we were bet- 
ter off not accepting more mail into the system 
and let it queue at the source. The key reason this 
made sense, and actually decreased overall ser- 
vice latency, is that as our queues built, we 
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processed more slowly. If we didn’t allow the 
queues to build, throughput would be higher. An- 
other technique is to service premium customers 
ahead of non-premium customers, or known 
users ahead of guests, or guests ahead of users 
if “try and buy” is part of the business model. 

e Meter admission. Another incredibly impor- 

tant concept is a modification of the admission 
control point made above. If the system fails 
and goes down, be able to bring it back up 
slowly ensuring that all is well. It must be pos- 
sible to let just one user in, then let in 10 
users/second, and slowly ramp up. It’s vital that 
each service have a fine-grained knob to slowly 
ramp up usage when coming back on line or re- 
covering from a catastrophic failure. This capa- 
bility is rarely included in the first release of 
any service. 
Where a service has clients, there must be a 
means for the service to inform the client that 
it’s down and when it might be up. This allows 
the client to continue to operate on local data if 
applicable, and getting the client to back-off 
and not pound the service can make it easier to 
get the service back on line. This also gives an 
opportunity for the service owners to communi- 
cate directly with the user (see below) and con- 
trol their expectations. Another client-side trick 
that can be used to prevent them all syn- 
chronously hammering the server is to intro- 
duce intentional jitter and per-entity automatic 
backup. 


Customer and Press Communication Plan 


Systems fail, and there will be times when laten- 
cy or other issues must be communicated to cus- 
tomers. Communications should be made available 
through multiple channels in an opt-in basis: RSS, 
web, instant messages, email, etc. For those services 
with clients, the ability for the service to communicate 
with the user through the client can be very useful. 
The client can be asked to back off until some specific 
time or for some duration. The client can be asked to 
run in disconnected, cached mode if supported. The 
client can show the user the system status and when 
full functionality is expected to be available again. 


Even without a client, if users interact with the 
system via web pages for example, the system state 
can still be communicated to them. If users understand 
what is happening and have a reasonable expectation 
of when the service will be restored, satisfaction is 
much higher. There is a natural tendency for service 
Owners to want to hide system issues but, over time, 
we’ve become convinced that making information on 
the state of the service available to the customer base 
almost always improves customer satisfaction. Even in 
no-charge systems, if people know what is happening 
and when it’Il be back, they appear less likely to aban- 
don the service. 


21st Large Installation System Administration Conference (LISA ’07) 241 


On Designing and Deploying Internet-Scale Services 


Certain types of events will bring press coverage. 
The service will be much better represented if these 
scenarios are prepared for in advance. Issues like mass 
data loss or corruption, security breach, privacy viola- 
tions, and lengthy service down-times can draw the 
press. Have a communications plan in place. Know 
who to call when and how to direct calls. The skeleton 
of the communications plan should already be drawn 
up. Each type of disaster should have a plan in place 
on who to call, when to call them, and how to handle 
communications. 


Customer Self-Provisioning and Self-Help 


Customer self-provisioning substantially reduces 
costs and also increases customer satisfaction. If a cus- 
tomer can go to the web, enter the needed data and just 
start using the service, they are happier than if they had 
to waste time in a call processing queue. We’ve always 
felt that the major cell phone carriers miss an opportu- 
nity to both save and improve customer satisfaction by 
not allowing self-service for those that don’t want to 
call the customer support group. 


Conclusion 


Reducing operations costs and improving service 
reliability for a high scale internet service starts with 
writing the service to be operations-friendly. In this 
document we define operations-friendly and summa- 
rize best practices in service design, development, de- 
ployment, and operation from engineers working on 
high-scale services. 
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ABSTRACT 


We propose RepuScore, a collaborative reputation management framework over email infras- 
trucure, which allows participating organizations to establish sender accountability on the basis of 
senders’ past actions. RepuScore’s generalized design can be deployed with any Sender Authenti- 
cation technique such as SPF, SenderID and DKIM. With RepuScore, participating organizations 
collect information on sender reputation locally from users or existing spam classification mecha- 
nisms and submit it to a central RepuScore authority. The central authority generates a global rep- 
utation summary which can be used to enforce sender accountability. We present the algorithms 
for reputation score calculation and share our findings from experiments based on a RepuScore 
prototype using a) our simulation logs and b) a 20 day log from a non-profit organization with five 


collaborating domains. 


Introduction 


In an effort to prevent sender address spoofing 
and phishing attacks, about 35% of all emails over the 
Internet are authenticated using sender authentication 
systems [13] such as DKIM [1, 23], SPF [22] and 
SenderID [11]. These systems allow receivers to au- 
thenticate the sender’s mail server before email deliv- 
ery to a mailbox. 


Authentication schemes alone, however, do not 
provide the organization with the capability to differ- 
entiate between a credible sender and an unscrupulous 
one. Indeed, it has been noted that spammers have 
been the early adopters of these systems. This shows 
that a sender’s identity does not necessarily guarantee 
their trustworthiness because trusting a sender can on- 
ly be possible after verifying their past adherence to 
best mail practices. Currently, sender identification 
techniques are being used as the basis for determining 
the sender’s history of adherence to best mail practices 
[21]. A reputation management system deployed at a 
single organization [3] has demonstrated that the his- 
tory of the sender’s adherence can provide an effective 
email classification mechanism. 


We believe that organizations would benefit from 
sharing senders’ reputation information that is individ- 
ually collected at each domain. By collecting reputa- 
tion scores from multiple organizations, the email re- 
ceivers could access a complete history of a sender’s 
past actions. Such a global perspective of a sender’s 
reputation would allow receivers to trust a sender that 
they have no prior information about. As with receiver 
collaboration, a sender’s spamming activity would be 
reported to all receivers: the onus is on the senders not 
to transmit unsolicited emails to any reputation-shar- 
ing receiver. 


In this paper, we propose RepuScore, a reputa- 
tion management framework for email infrastructure 
that uses receiver collaboration to compile global rep- 
utation for a sender. RepuScore helps create and main- 
tain a trusted group of organizations. We discuss the 
deployment of RepuScore with sender authentication 
techniques. The design considerations for RepuScore 
are as follows: 


First, the RepuScore framework can be used to 
collect, compute and share reputation among organiza- 
tions. To keep track of the sender’s history of adher- 
ence, RepuScore takes into account the reputation of 
the sender in the previous time frame along with the 
spam rate in the present time frame. Towards this, Re- 
puScore employs the Time Sliding Window Exponen- 
tially Weighted Moving Average (TSW-EWMA) algo- 
rithm. 

Second, RepuScore eases the overhead of reputa- 
tion collection and computation with the help of a dis- 
tributed architecture. Such architecture allows each or- 
ganization to collect votes from its users. However, 
distributing the reputation management creates addi- 
tional challenges. 


Since RepuScore employs a distributed reputa- 
tion framework, it is susceptible to Sybil attacks [20, 
26]. In Sybil attacks, a malicious receiver manipulates 
the rating mechanism by creating multiple identities to 
give a higher rating to emails sent from the colluding 
senders and a lower rating to legitimate senders. Sybil 
attacks are thwarted by valuing a reputable partici- 
pant’s rating more highly than that of a less reputable 
participant. RepuScore employs the Weighted Moving 
Algorithm Continuous (WMC) [24] to thwart Sybil at- 
tacks. RepuScore introduces a participant voting thresh- 
old, a minimum threshold required by organizations to 
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participate in global reputation computation, to mitigate 
Sybil attacks. 


Third, RepuScore supports a centralized reputa- 
tion scoring mechanism with minimal overhead. This 
centralized mechanism creates a trusted group of rep- 
utable senders. The lack of centralized enforcement 
has been cited as the main obstacle in tying email 
fraud to a particular user or organization [10]. 


The remainder of this paper is organized as fol- 
lows. In the following section, we discuss the related 
work. We then describe the design issues for a reputa- 
tion management framework and present the RepuS- 
core design in the next section followed by its proto- 
type implementation. We then discuss our results and 
conclude in the last sections. 


Related Work 


In this section, we discuss the reputation man- 
agement frameworks that have been designed for 
email infrastructure, followed by a discussion on 
sender identity systems. 


Reputation Systems for Email Infrastructure 


SenderPath’s Sender Score [19] and Habeas’ 
SenderIndex [8] provide reputation for a sender’s IP 
address. SecureComputing’s TrustedSource [4] pro- 
vides a global reputation system with the help of de- 
ployed mail servers in different organizations. Reputa- 
tion based on IP addresses is not effective, as an IP ad- 
dress cannot be bonded to a specific organization [5]. 
For instance, when multiple organizations share an IP 
address, spammers in a single domain can affect the 
reputation of users in other organizations. Moreover, if 
organizations move to another service provider, their 
past actions would no longer be attributed to them. We 
believe that a reputation should be more closely asso- 
ciated with the organization, possibly utilizing the do- 
main name of the organization. 


Project Lumos [9] was proposed as an effort to 
provide reputation among collaborating ISPs. The re- 
ceivers provided feedback as to whether a sender was 
a spammer or otherwise. Reputation was based on the 
activity of the previous 180 days. Project Lumos was 
designed to consider the weighted average of previous 
and present reputation of the senders. We believe that 
to thwart Sybil attacks and provide an open reputation 
management system, the reporter’s reputation should 
also be taken into account in order to provide an accu- 
rate summary of a sender’s reputation. 


Google’s reputation service [3] identifies the 
senders using best-guess SPF [22] or DKIM [1, 23] and 
computes the sender’s reputation based on the inputs 
from users. This system demonstrated a high accuracy 
in classifying Google’s emails. The paper also points 
out the need for a third party reputation framework. 


Certification Systems for Email Infrastructure 


Systems like SenderPath’s SenderScore Certified 
[18], Habeas’ Safelist [7] and Goodmail’s Certified 
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Email [6] are certification and accreditation services. 
These services allow bulk senders to obtain third party 
certification to be able to send bulk emails. They are 
not really reputation systems, as the sender maintains 
the reputation and not the receivers. 


Identity Based Email Classification 


Receivers can identify senders based on the 
sender’s email ID, IP address or domain. For instance, 
PGP [15, 27] is an email Id-based authentication tech- 
nique where a third-party server maintains individual 
users’ public keys. The receivers verify the senders’ 
signed emails by retrieving the sender’s public key. 


IP addresses are used to identity spammers in sys- 
tems such as Blacklist IP and Real-time Blackhole List 
(RBL) [17] that keep a list of IP addresses that propa- 
gate spam. Though several RBLs are available, a recent 
study has shown that only 50% of spam is correctly 
identified by combined use of two or more lists [16]. 


We believe that maintaining a group of high- 
spam-propagating domains is more difficult than main- 
taining a group of non-spam-propagating domains. 
Spamming senders usually do not exist for long peri- 
ods of time, whereas non-spamming senders usually 
exist for long periods of time. 


SenderID [11] verifies the IP address presented 
by the email against that of the sender’s registered 
mail servers. Using Sender Policy Framework (SPF) 
[22], a receiver thwarts sender forgery by identifying the 
sender’s mail server through DNS entries. Domain Key 
Identified Mail (DKIM) [1, 13] publishes the mail serv- 
er’s public keys as a part of DNS records. Each email is 
signed by the sender’s mail server. The signature is used 
by the receiver’s mail server to verify the sender. 


Accredited DomainKeys adds a central authority 
to DomainKeys architecture [12]. The centralized au- 
thority, called the Accreditation Bureau, maintains the 
sender domain’s public key. The users should conform 
to a specified usage policy and adherence to the policy 
is checked periodically. We suggest that a reputation 
based mechanism where the receivers can vote on 
whether the senders adhere to the specified usage poli- 
cy would help in enforcement of the usage policy. 


RepuScore Design 


In this section, we describe the design consid- 
erations for a reputation framework. We note that 
authentication techniques and a reputation frame- 
work work together to create a trusted group of rep- 
utable senders. A verified identity (through an existing 
authentication mechanism) is a required basis for 
maintaining sender’s reputation. Moreover, a reputa- 
tion service is able to guide a receiver through the 
process of validating the sender before the sender’s 
emails are accepted 


Sender Identity Techniques 


Email Id-identity systems, such as PGP, can be 
used to maintain reputation. However, using email ids 
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entails maintaining a huge overhead for vote collec- 
tion, storage and reputation computation. Instead of 
using email ids as identities for reputation manage- 
ment, we use domain authentication schemes, thereby 
decreasing the number of identities needed. We be- 
lieve that this approach is more scalable than the email 
id based reputation system. 


As mentioned, about 35% of all authenticated 
email over the Internet is authenticated using SPF, 
DKIM or SenderID [13]. A reputation management 
system can be built to help evaluate the senders who 
are being authenticated using these mechanisms. Such 
a mechanism will help evaluate the domains that ad- 
here to a common guideline. 


The lack of a centralized authority [10] has been 
noted as a main reason for the inability to tie email 
forgery to a single user or the organization. A central 
authority can maintain a trusted group of reputable 
senders where each sender needs to maintain a high 
reputation. Such a mechanism allows a common best 
email practice to be enforced among senders. 





Listing 1: Collection of reputation votes by different 
RepuServers in an organization. The sender’s identity 
could be used with any domain authentication tech- 
nique. The reputation votes can be submitted either by 
users such as Bob or using any currently available fil- 
tering mechanism. Each domain maintains a single 
RepuCollector that collects the reputation votes from 
the multiple RepuServers in the organization. 


Design of RepuScore Framework 


A reputation management framework should on- 
ly accept a single reputation vote from each organiza- 
tion. A large global organization might have multiple 
mail servers, each situated in different geographic lo- 
cations, for example, in different countries. If a reputa- 
tion management framework considered votes from 
mail servers, an organization with a huge number of 
mail servers would have greater say than organizations 
with a single mail server. Hence, each organization 
should be given a single vote that should be the aggre- 
gate of all the mail servers in the domain. 


We define RepuServer as a mail server with the 
capability of verifying the users and collecting the 
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reputation votes from them. Each local RepuServer at 
a domain collects votes from its users and email fil- 
ters, aggregates the votes locally, and forwards them 
to the RepuCollector of the domain. We define a Re- 
puCollector as an organizational level service that ag- 
gregates the votes from the local RepuServers and par- 
ticipates in a global reputation of peer RepuCollectors. 


Each RepuServer records the total number of 
emails received during a given period and counts the 
ones that are considered to be spam by the currently 
available spam-filtering mechanism or by the user’s 
input. Figure 1 demonstrates the mechanism for the re- 
ceivers to report spam to their local RepuServer. Such 
reports can also be performed by currently available 
email classification techniques without user involve- 
ment. In the event that the report is conflicting, a us- 
er’s input can be used to increase the reputation of the 
sender. 


The RepuCollector’s reputation should decrease 
for bad behavior and increase in the absence of bad be- 
havior. For example, if spam is reported, the sender’s 
reputation should decrease. If no spam is reported, the 
reputation should increase. 


An ideal initial reputation is a requirement for 
building the reputation of a new RepuCollector or Re- 
puServer. An improper initial reputation would give 
high spam propagating domains an unfair advantage 
as their reputation would stay high for a long time. In 
contrast, a low initial reputation would be unfair to a 
new domain as its emails would not be accepted by 
peers. 


Since the sender’s reputation changes over time 
and is computed after receiver collaboration, the repu- 
tation is computed in every time period. We define this 
period as the Reputation Aggregation Interval. A Re- 
puCollector should invest a significant number of rep- 
utation aggregation intervals to be considered a good 
sender. Such a mechanism would make spamming un- 
viable for a spammer as it would require a significant 
investment of resources, including both time and mon- 
ey. In addition, a quick reduction in reputation for 
non-adherence to the policy removes spammers from 
the trusted group of senders. 


Finally, the reputation framework should guard 
against Sybil attacks [20, 26] where users with multi- 
ple identities attempt to change the reputation of the 
senders [24]. We believe that domains which transfer 
high amounts of spam would attempt to unfairly in- 
crease their reputations in order to be considered part 
of the trusted group of senders. To thwart such attacks, 
RepuCollectors having a reputation lower than a given 
threshold, which we refer to as the Participation 
Threshold, would not be allowed to participate in the 
voting mechanism. 


RepuScore Prototype Implementation 


Our framework, RepuScore, is a generic design 
that can be employed by sender identity systems. As 
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discussed in the related work section, RepuScore cre- 
ates a trusted group of reputable senders. We believe 
that a reputation framework should facilitate creation 
of such a group rather than just maintaining a group of 
blacklisted senders. In this section, we describe a gen- 
eralized architecture that can be used with RepuScore. 
We generalize the architecture so that any sender au- 
thentication scheme can be used along with RepuS- 
core. The architecture describes vote collection, repu- 
tation computation and also centralized reputation 
computation. Finally, we discuss the RepuScore algo- 
rithm used for reputation computation. 


RepuScore differs from other approaches be- 
cause of the collaborative reputation based on the 
scores suggested by peers. As RepuScore is designed 
as a collaborative mechanism, it has been designed to 
protect against Sybil Attacks, where a single attacker 
can take multiple identities. Towards this, RepuScore 
takes into account the reputation of the reporting serv- 
er along with the reputation they report for a peer. 


RepuScore Architecture 


RepuScore’s hierarchical architecture is designed 
so that the reputation collection and computation is 
manageable as the number of participating domains 
increase. The RepuScore framework computes reputa- 
tion based on the votes collected by each RepuServer. 
While collecting reputation votes, a RepuServer checks 
the validity of the reporting users. The user’s votes are 
based on their evaluation of the sender’s adherence to 
best practices. We outline three major steps in RepuS- 
core’s architecture: 

a) Reputation Vote Collection 
b) Reputation Computation 
c) Reputation Lookup Service 


Reputation Vote Collection 


As the definition of spam is subjective, an email 
regarded as spam by one user might not be considered 
so by another. Therefore, a global blacklist or white 
list would not be ideal as it would fail to represent the 
conflicting views of multiple users. RepuScore em- 
ploys a social rating mechanism to consider the con- 
flicting views of the users. 


The receiver’s RepuServer can maintain the num- 
ber of emails received and the emails marked as spam 
for each sender RepuServer. The vote collection mecha- 
nism should require minimal participation from the 
users. For example, RepuServer collects the users’ 
votes based on the users’ implicit inputs. Users only 
need to mark an incorrectly filtered-email as non-spam 
or to report a spam email that was not correctly fil- 
tered by the spam classifiers. (Many email services 
provide similar mechanisms for their users to report a 
spam email or an incorrectly filtered email.) Figure | 
demonstrates the mechanism in which the RepuServer 
collects the votes from multiple users. Before accept- 
ing votes from the users, the RepuServer should vali- 
date the users. 
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The spam classifiers are also used along with 
users’ input in collecting votes. A negative vote for a 
sender is entered when the spam filters determines an 
email as spam. Likewise, a positive vote for a sender 
is automatically made when the sender’s email is not 
considered spam. In the event that the spam filter 
marks a legitimate email as spam, the users can mark 
the email as non-spam, submitting a positive vote for a 
sender to the RepuServer. 


Reputation Computation 


Based on the number of spam and emails collect- 
ed, each RepuServer calculates the reputation of the 
sender RepuServer. RepuServer Reputation is defined 
as the weighted average of its reputation in the previ- 
ous reputation aggregation interval and the reputation 
computed in the present reputation aggregation inter- 
val. 


RepuScore calculates the reputation of a Repu- 
Collector based on the reputation of the RepuServers 
maintained by the RepuCollector. We define the Repu- 
Collector Reputation as the aggregate reputation of the 
RepuServers in their domain in the present reputation 
aggregation interval. 


Each RepuCollector calculates the local reputation 
for each peer RepuCollector. The computed reputation 
is digitally signed by each RepuCollector to maintain 
the integrity of the data. To provide a global perspec- 
tive, the locally computed RepuCollector’s reputations 
should be collected by the Central Authority. 


RepuScore introduces a central authority that col- 
lects reputation votes from all the RepuCollectors and 
computes the global reputation for all RepuCollectors. 
The central authority verifies the RepuCollector’s votes 
based on the digital signature. The central authority 
should make sure that the reputation collection is con- 
ducted once every reputation aggregation interval. The 
central authority calculates a global reputation for each 
RepuCollector based on the change in its reputation as 
reported by peer RepuCollectors. The central authority 
takes into account the reputation of the RepuCollectors 
to compute the global reputation of the peer RepuCol- 
lectors. If the reporting RepuServers’ reputation is be- 
low the participation threshold, their reputation votes 
are not factored into the global reputation. 


Reputation Lookup Service 


A reputation Lookup service can be provided 
with the help of a third party lookup service. The repu- 
tation lookup service can be similar to Realtime Black 
Lists. Such a reputation look up service can also provide 
a mechanism for the receivers to lookup the reputation of 
a sender’s RepuCollector as reported by peers. 


An alternate way for receivers to determine repu- 
tation is by associating the reputation with a sender 
identity that can be vouched for by a third party. For ex- 
ample, in the case of Accredited DomainKeys, the rep- 
utation can be embedded as the part of the seal that is 
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supplied to the MTAs. When the client checks the DNS 
entries, the seal can be verified for the reputation. 


RepuScore Algorithm 


In this section, we discuss RepuScore’s algo- 
rithm. The RepuCollector’s reputation is calculated 
based on the reputation of all the RepuServers it main- 
tains. We first demonstrate how each RepuCollector 
calculates the reputation of peer RepuServers. We then 
discuss the reputation computation of peer RepuCol- 
lectors. Finally, we demonstrate how a global reputa- 
tion is calculated. 


With the help of reputation, administrators in an 
organization can evaluate the compliance of the do- 
main by checking their organization’s reputation ser- 
vices. If the domain’s reputation is lower than expect- 
ed that would imply that there might be bots on the 
server [14]. 


RepuServer Reputation Calculation 


As discussed in the above sections, a RepuServ- 
er’s reputation is calculated by peer RepuServers. The 
reputation in RepuScore is always in the open interval 
(0, 1). A score of 1 indicates a highly reputable sender 
whereas a score of 0 indicates a sender with a low rep- 
utation. For all sender RepuServers, each receiving 
RepuServer maintains the number of emails received 
and the number among those marked as spam. The 
reputation of a RepuServer is computed as the number 
of good emails over the number of emails sent by a 
RepuServer in a particular interval. The reputation is 
calculated based on the modified time sliding window 
exponentially weighted moving average (TSW-EW- 
MA) algorithm [2]. 

Equation | displays the weighted moving average. 
The RepuServer Reputation is based on the reputation in 
the previous interval and the reputation in the present in- 
terval. Correlation factor indicates the amount of pre- 
vious reputation considered for computation of the Re- 
puServer’s reputation in the new interval. If the correla- 
tion factor is high, the reputation of a sender takes a 
long time to increase or decrease, as a lot of weight is 
given to the previous reputation. However, if the corre- 
lation factor is low, the reputation increases or decreases 
very quickly since current actions are given additional 
weight. We demonstrate the effect of the correlation fac- 
tor on reputation in our experiments. 


RepuCollector Reputation Calculation 


Based on the change in a RepuServers’ reputa- 
tion, the RepuCollector’s reputation can be updated. 
Equation 2 shows the local reputation computation of 
a RepuCollector. The local RepuCollector reputation 
is the average reputation of all of its RepuServers. 
Each RepuServer transmits the reputation of the peer 
RepuCollectors to the central authority. As discussed 
in the above sections, the central authority considers 
the votes only from the RepuCollectors whose reputa- 
tion is greater than the participation threshold. We 
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Given: 
e £,, is the set of emails received by RepuServer 
in reputation aggregation interval m 
e §; is the RepuServer i 
For all RS; € Set E,,: | , 
° ) . =]- eel 
ERED * n(TotalEmails,,) 
© Rep (S;,m) = a x Rep(S;,m—1) + 
(1 — a) x NewRep(S;, m) 
Where: 
e Rep (S;, m) is Sender RepuServer’s Reputation 
in the interval m. 
© n(spam,,) is the number of spam received in the 
interval m. 
e n(TotalEmails,,) is the total number of emails 
received in the interval m. 
ea (0<a@< 1) is the correlation factor between 
the previous and the present value. 
Equation 1: Local RepuServer reputation. 


Given: 
e RepuCollector reputation is the average of the 
reputation reported to a RepuCollector’s by its 
RepuServers. 


For all RepuCollectors: 
n(RW) 
LocalCollectorRep,.(m) = ——~ > repu(S;,m) 


n(Sp) j=0 
Where: 

¢ LocalCollectorRep,, is the local reputation of 
RepuCollector c reported by RepuServer o in 
the organization. 

e¢ n(S,) is the number of Sender RepuServer seen 
by a RepuServer o. 
Equation 2: Local RepuCollector reputation. 


Given: 
¢ The RepuCollector reputation is weighted mov- 
ing average continuous of local reputation com- 
puted in the mth Interval. 
For all RepuCollectors, Central Authority calculates: 
CollectorRep,(m) = 


x CollectorRep,(m— 1) x Local Collector;,,(m) 


- >, CollectorRep,(m— 1) 
n=0 
Where: 
e RD-Repu, is the global reputation of RepuCol- 
lector;. 


Equation 3: Global RepuCollector reputation. 


demonstrate that such a mechanism helps in the cre- 
ation of a trusted group of reputable senders. 


The central authority calculates the global repu- 
tation of the RepuCollectors based on a modified 
Weighted Majority Algorithm (WMA) called WMA 
Continuous (WMC) proposed by Yu et al [24]. The 
WMC algorithm has been used in peer-to-peer sys- 
tems to detect deception. We provide the participation 
threshold as a mechanism to remove domains that 
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propagate spam and increase reputations of other 
senders. 


Equation 3 demonstrates the Global RepuCollec- 
tor reputation as the reputation-weighted average of 
the local RepuCollector reputation computed by each 
peer. The new reputation is computed once every rep- 
utation aggregation interval and is valid for one Ag- 
gregation Interval. 


Experiments and Results 


In this section, we demonstrate the effectiveness 
of RepuScore through experiments. We accomplish 
this with the help of a) simulated logs to demonstrate 
specific properties of RepuScore and b) real logs from 
a non- profit organization. The logs from the organiza- 
tion were 20-day logs collected by five domains that 
they maintained. The log contained information about 
45K domains to which about 450K emails were sent, 
55% of which were marked as spam by RBLs or re- 
jected since the sender domain were determined not to 
exist through DNS reverse lookup. 


Effect of «% on Reputation of a Trusted RepuCollec- 
tor With Sudden Increase in the Amount of 
Spam it Transmits 


Spammers might attempt to thwart RepuScore by 
building reputation and then suddenly transmitting 
huge amounts of spam. In such cases, it is expected 
that the reputation of the sender would decrease and 
the spammer would be removed from the trusted 
group within a minimal number of reputation aggrega- 
tion intervals. 


To demonstrate the effectiveness of RepuScore, 
we created logs with 100 RepuCollectors spanning 45 
reputation aggregation intervals. We selected a random 
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number of RepuServers which reported to their local 
RepuCollectors. The number of emails and spams that 
were transmitted to and from an organization was per- 
turbed using a random number; for example, since Re- 
puScore creates a trusted group of reputable senders, 
the spam rate among them was set at under 20%, 
whereas a spamming domain’s spam rate was set at 
greater than 95%. (We see this trend in the logs from 
the non-profit organization.) 


Figure 2 demonstrates the reputation of a Repu- 
Collector from which the amount of spam suddenly in- 
creased as a function of &. For the first 30 reputation in- 
tervals, the RepuCollector built its reputation and at- 
tempted to be a part of the trusted group. After reputa- 
tion interval 30, the spam rate from the RepuCollector 
increased to 95%. The RepuCollector’s reputation is 
based on the reputation of all its RepuServers. The jump 
in the value of reputation is due to the value of and 
the initial reputation value of RepuDomain that was 
set at 0.5. Therefore, the reputation of the RepuCollec- 
tor for % = 0.9 decreased from 0.7 after the first repu- 
tation aggregation interval. In cases where the sender 
does not propagate spam, the reputation should in- 
crease slowly, which indicates a long past history. 
Hence the high value of @ implies an association for a 
long history of good actions. If the sender propagates 
spam, the reputation should decrease immediately, re- 
flecting the current actions of the sender. A low value 
of ® guarantees an immediate reduction when the 
sender propagates spam. Equation 4 demonstrates our 
change in the reputation algorithm to accommodate 
this behavior. Figure 3 demonstrates the change in rep- 
utation by employing the modified algorithm. For a 
high a, the reputation increases gradually but decreas- 
es more rapidly. 
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Figure 2: demonstrates the change in the reputation score of a trusted domain that transmits spam after reputation 
interval 30 as a function of «. The reputation eventually converges to (1 - average spam rate) over multiple rep- 
utation intervals. High & puts more weight to previous reputation score, whereas low & puts more weights to 
current score. Thus, for high values of o, it takes long time for the reputation to be built up whereas for low o 
value the decrease (or increase) in reputation is faster. The sudden drop from the initial score to the first interval 
is due to the effect of a. The RepuCollector’s reputation has been set at 0.7. In the future intervals, the RepuCol- 
lector reputation is based on the reputation of all RepuServers which starts at 0.5. Therefore, for a = 0.9, the 


reputation of RepuDomain is around 0.55. 
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If (Rep (S;,m — 1) 2 NewRep (S;,m)) 
Rep (S;,m) = a x Rep(Si,m— 1) + 
(1 — alpha) x NewRep (S;,m) 
Else 
Rep(RS;,m) = (1 — alpha) x Rep (RS;,m— 1) + 
a x NewRep (S;,m) 
Equation 4: Local RepuServer reputation. 


Figure 4 shows the modified RepuScore algo- 
rithm with collaboration among multiple domains us- 
ing the 20 day logs from the non-profit organization. 
The reputation of the spamming domain decreased, 
but the reputation of a good domain increased. 


Participation Threshold and Initial Values for Repu- 
Collector 


Having an appropriate initial value for RepuCol- 
lector’s reputation is extremely important to maintain 
a trusted group of reputable senders. For instance, if 
the initial reputation scores for the RepuCollector and 
RepuServers are set too high, it would take a long time 
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for the reputation to decrease. On the other hand, if the 
initial reputation is set too low, it would take a long 
time for the reputation of a non-spamming RepuCol- 
lector to increase. 


Our experiments show that an ideal initial repu- 
tation value for the RepuServer and the RepuCollector 
is between 0.5 and 0.7. With different initial values we 
noted that the average reputation of all the domains 
using the logs from the non-profit organization con- 
verged to about 0.6 for a = 0.1, 0.47 for a=0.5 and 
0.36 for a = 0.9. Hence, an ideal initial reputation 
should be equal to the average reputation of all do- 
mains in the system after a long period of time. In order 
for the new reputation domains to participate in the rep- 
utation aggregation intervals, the threshold should be 
0.1-0.3 below the initial reputation. 


Resilience to Sybil Attacks 


We increased the percentage of malicious Repu- 
Collectors from 10 to 30% to demonstrate RepuScore’s 
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Figure 3: In the modified RepuScore algorithm, a high value of o (other than 1.0) implies gradual increase, but fast 
decrease in reputation when the domain starts to spamming. 
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Figure 4: Using modified RepuScore algorithm with 20 days log from a non-profit organization. A number of new 
RepuCollectors were introduced at different reputation aggregation intervals. 
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resilience to Sybil attacks. Each RepuCollector trans- 
mits a high amount of spam (> 95%) for the first 30 
reputation aggregation intervals. After 30 reputation 
intervals, we had the Sybil attacker to start increasing 
the reputation of its own Sybil domains and decrease 
the reputation of other domains. Figure 5 demonstrates 
our results. The reputation of the Sybil domains 
steadily decreased, but the reputation of the non-Sybil 
domains increased. 


Conclusion 


RepuScore is a collaborative reputation frame- 
work that collects votes from multiple organizations in 
order to collectively compute the reputation of a 
sender. We believe that RepuScore is a step toward 
enforcing sender accountability through collaboration 
among domains. 


Simply blacklisting spammers is ineffective be- 
cause spammers continue to easily create new sender 
identities. In contrast, a legitimate sender’s identity 
typically exists for long periods of time. Thus, we be- 
lieve a reputation framework such as RepuScore will 
be more effective in blocking spam email by maintain- 
ing a group of reputable trusted senders rather than 
identifying spamming domains. 

RepuScore distributes the overhead for reputa- 
tion collection and computation by using a distributed 
architecture while allowing a centralized authority to 
collectively calculate the global reputation for each 
sender domain. 


Our experiments using simulated logs and an ac- 
tual log from a non-profit organization demonstrated 
RepuScore’s effectiveness and its ability to thwart Sybil 
attacks. We also presented the algorithms for reputation 
score calculation and demonstrated the effect of the cor- 
relation factor & where a sender’s reputation increases 
gradually when it does not propagate spam but decreas- 
es immediately when it transmits spam. 
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Availablity 


RepuScore will be an open-source effort aimed 
to provide participating domains with the ability to 
contribute information about senders and also lookup 
the collected reputation about them. RepuScore will 
be made available from http://isr.uncc.edu/RepuScore . 


Acknowledgement 


We would like to thank our shepherd, Peter 
Galvin, and the anonymous reviewers for their insight- 
ful comments. We would also like to show our appre- 
ciation to our LISA copy-editor, Rob Kolstad, for his 
excellent help. Finally, we thank our ISR lab member 
Sumeet Jain, for his help from the early stage of this 


paper. 
Author Biographies 


Gautam Singaraju is a fifth year doctoral student at 
University of North Carolina at Charlotte and is advised 
by Dr. Kang. Previously, he completed M.S in Comput- 
er Science at UNC Charlotte and a B.Tech in Electron- 
ics and Communication Engineering at JNTU, India. 
Since 2003, he has also been a volunteer System Ad- 
ministrator for a global non-profit organization. During 
the summer of 2007, he has worked at VMware with the 
performance group. Gautam can be reached at gsin- 
gara@uncc.edu. 


Brent Hoon Kang received his Ph.D in Computer 
Science from the University of California at Berkeley, 
working on the Berkeley Digital Library and Ocean- 
Store project. Prior to Berkeley, he received an M.S in 
Computer Science from the University of Maryland at , 
College Park, and a B.S in Computer Science and Sta- 
tistics from Seoul National University. Since Fall 
2004, he has been an assistant professor at the Universi- 
ty of North Carolina (UNC) at Charlotte, and has been 
leading the Infrastructure Systems Research (ISR) Lab. 
As part of his research efforts, he recently worked on IT 





o eo 2 2©£& 2 
in mn ~~ & w 


& 
5 
co 
o 
c 
& 
% 04 
5 
$ 0. 
a 


—— Sybil Domain with 10% Sybil domains 
—x— Sybil Domain with 20% Sybil domains 
—--—- Sybil Domain with 30% Sybil domains 





123 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 MH 32 33 34 35 36 37 38 39 40 41 42 43 44 45 
Number of Reputation Aggregation Intervals 





Figure 5: Multiple spamming domains (under a Sybil attacker’s control) increase their votes for each other after the 
reputation aggregation interval 30. The domains give each other high reputation scores and attempt to decrease 
the reputation of other domains. However, our RepuScore framework was resilient to Sybil attack. The partici- 


pation threshold was set to 0. 


250 21st Large Installation System Administration Conference (LISA ’07) 


Singaraju & Kang 


infrastructure design and administration issues related to 
protecting infrastructure against security threats such as 
the bots/malwares and the email spam/phishing prob- 
lems. As part of his efforts on Information Assurance 
(IA) education program, he has been developing the 
hands-on cyber exercise components that foster stu- 
dents’ creativeness and problem solving skills for IT 
systems design and defense. Hoon can be reached at 
bbkang@uncc.edu. 


Bibliography 


[1] Allman, E., DomainKeys Identified Mail (DKIM): 
Introduction and Overview, 2005, http://mipas- 
soc.org/dkim/info/DKIM-Intro-Allman.html . 

[2] Biswas, S. and R. Morris, “ExOR: Opportunistic 
Multi-Hop Routing for Wireless Networks,” Pro- 
ceedings of ACM SIGCOMM ’05, Philadephia, 
2005. 

[3] Taylor, Bradley, ““Sender Reputation in a Large 
Webmail Service,” Third Conference on Email 
and Anti-Spam (CEAS 2006), 2006. 

[4] CipherTrust, 7rustedSource: The Next-Genera- 
tion Reputation System, White Paper, 2006. 

[5] Dewan, P. and Dasgupta, P., “‘Pride: Peer-to-Peer 
Reputation Infrastructure for Decentralized Envi- 
ronments,” Proceedings of the 13th International 
World Wide Web Conference on Alternate Track 
Papers & Posters, pp. 480-481, ACM Press, New 
York, NY, USA, May 19-21, 2004. 

[6] Goodmail Systems, Certified Email, http://www. 
goodmailsystems.com/certifiedmail . 

[7] Habeas, Habeas Safe List, http://www.habeas.com/ 
en-US/Senders/Safelist/ . 

[8] Habeas, Habeas SenderIndex, http://www.habeas. 
com/en-US/Receivers/SenderIndex/ . 

[9] Brondmo, Hans Peter, Margaret Olson, Paul Bois- 
sonneault, Project Lumos: A Solutions Blueprint 
for Solving the Spam Problem by Establishing 
Volume Email Sender Accountability, 2003. 

[10] Jakobsson, Markus, Steven Myers, Phishing and 
Countermeasures, Understanding the Increasing 
Problem of Electronic Identity Theft, Wiley, 2006. 

[11] Microsoft Corporation, Sender ID Framework — 
Executive Overview, 2004. 

[12] Goodrich, Michael T., Roberto Tamassia, Danfeng 
Yao, “Accredited DomainKeys: A Service Archi- 
tecture for Improved Email Validation,” Proceed- 
ings of the Second Conference on Email and Anti- 
Spam (CEAS), 2005. 

[13] Peterson, Patrick, ‘““SIDF and DKIM overview 
Scorecard,” Authentication Summit II, 2006, http:// 
www.aotalliance.org/summit_archive/pdfs/2_ 
Summit_Scorecard_final.pdf. 

[14] Proofpoint, High-performance Email Reputation 
and Connection Management, March, 2007, http:// 
www.proofpoint.com/products/dynamic-reputation. 
php . 


RepuScore: Collaborative Reputation Management Framework ... 


[15] Price, W., Inside PGP Key Reconstruction, A 
PGP Corporation White Paper, 2003. 

[16] Ramachandran, Anirudh and Nick Feamster, “Un- 
derstanding the Network-level Behavior of Spam- 
mers,” Proceedings of ACM SIGCOMM, Pisa, 
Italy, 2006. 

[17] Realtime Blackhole List, Mail Abuse Prevention 
System LLC, California, 2002, http://www.mail- 
abuse.org/rbl/. 

[18] Sender Score Certified, Return Path Manage- 
ment, http://www.senderscorecertified.com . 

[19] Return Path, Sender Score Email Reputation Man- 
agement, http://www.returnpath.com/delivery/ 
senderscore . 

[20] Srivatsa, M., L. Xiong, L. Liu, ““TrustGuard: Coun- 
tering Vulnerabilities in Reputation Management 
for Decentralized Networks,” /4th World Wide Web 
Conference (WWW 2005), Japan, 2005. 

[21] Jordan, Stephanie, Matt Blumberg, Des Cahill, 
Richard Gingras, “Accountable Email: Building 
on Authentication,” Authentication Summit II, 
2006, http://www.aotalliance.org/summit_archive/ 
pdfs/7_building on_authentication.pdf. 

[22] Wong, M. W., Sender Authentication: What To 
Do, Technical Document, 2004, http://www.open- 
spf.org/whitepaper.pdf . 

[23] Yahoo Inc., DomainKeys: Proving and Protecting 
Email Sender Identity, http://antispam.yahoo.com/ 
domainkeys . 

[24] Yu, Bin and Munindar P. Singh, “An Evidential 
Model of Distributed Reputation Management,” 
Proceedings of the Ist International Joint Confer- 
ence on Autonomous Agents and MultiAgent Sys- 
tems (AAMAS), 2002. 

[25] Yu, Bin and Munindar P. Singh, “Detecting De- 
ception in Reputation Management,” Proceed- 
ings of the 2nd International Joint Conference on 
Autonomous Agents and MultiAgent Systems (AA- 
MAS), Melbourne, ACM Press, 2003. 

[26] Yu, H., M. Kaminsky, P. B. Gibbons, and A. D. 
Flaxman, “Defending Against Sybil Attacks via 
Social Networks,” Proceedings of ACM SIG- 
COMM Conference, 2006. 

[27] Zimmermann, P., The Official PGP User's Guide, 
MIT Press, Cambridge, 1995. 


21st Large Installation System Administration Conference (LISA ’07) 251 
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ABSTRACT 


We describe a tool that provides a method for running dataless caching clients — a hybrid 
combination of imaging systems with traditional diskless nodes. Unlike imaging systems, it is a 
single boot to get to a running system; unlike diskless systems, it is more robust and scalable as it 


does not continuously depend on central servers. 


The tool, Moobi, uses the peer-to-peer protocol BitTorrent to provide efficient distribution of 
the image cache, and combines standard diskless tools to provide the basis for the running system. 
Moobi makes it possible to run large installations of ‘thin server” farms. 


Introduction 


Diskless clients are appealing for large clusters 
of similar systems performing similar functions since 
they limit the capability for local configurations and, 
more specifically, local misconfigurations on each sys- 
tem. This provides a more uniform environment and 
allows for the management of a large number of sys- 
tems with fewer resources. The key for these systems 
is not so much that they are diskless in that they are 
dataless — there is nothing which is inherently tied to 
the nodes. In the desktop realm, these are typically re- 
ferred to as thin clients. In the server realm, “‘throw 
away” and “field replaceable units(FRUs)” are some 
common terms. 


The difference between diskless and dataless 
hinges upon drive cost, application robustness over 
drive failure rates, and the need for local working disk. 
The size of drives has increased significantly, especially 
when compared to the size of operating systems and ap- 
plications bases. Pinheiro, et al. [1] demonstrated that 
annual hard drive failure rates were between 2% and 
10% depending on age and usage; even with 2%, in a 
large population of machines, there is an inherent need 
for applications to account for downed nodes and 
thereby make drive failure moot. In addition, a local 
drive provides a convenient location for cached data or 
temporary work space, and so they are included in most 
clusters regardless of management methods. These rea- 
sons push commodity servers towards having local stor- 
age and treating it as a large cache or work disk, but not 
as a true data disk. 


Since these systems have drives, they also have a 
unique local instance of the operating system. Whether 
this operating system is installed via an imaging system, 
or installed via an automated installation mechanism is 
largely irrelevant since the result is an individual system 
instance with a unique instance of every part of that sys- 
tem. This individual identity is necessary for certain 
items — hostname, ssh keys — but is undesirable for other 
items — /usr software. Any configuration management 


system for these setups must account and check for ev- 
ery piece individually to validate the consistency of the 
environment. 


Truly diskless clients do not have this problem. 
Nothing is locally maintained, and a single instance, 
typically of an NFS server, is directly used for the en- 
vironment. This provides the consistency, but loses in 
performance: diskless clients are very hard to scale to 
100s of nodes. This limitation on scaling is due to 
their reliance on the large file servers that support the 
diskless nodes. An installed environment suffers a 
similar scaling issue whenever massive deployments 
have to happen in short order — initial spin ups, disas- 
ter recovery, or massive updates. Diskless client envi- 
ronments have a continuous scaling issue, while the 
installed environments have “impulse” scaling issues. 


Moobi combines the consistency of diskless sys- 
tems with the efficiency of imaged or installed sys- 
tems. It achieves this by dividing the operating system 
image into a small configurable portion — the root or 
/etc — and a large fixed image — /usr — and by caching 
and sharing the large image via BitTorrent. The fixed 
image is maintained as a whole instance, so only one 
item needs to be validated against the source. Every 
time a node boots, or as often as is necessary for audit- 
ing purposes, it can scan the cache and downloads and 
updates the cache with any new image. 


BitTorrent is used for validating and updating the 
image cache. Due to its nature, these two operations are 
inherently connected. In order to download and update, 
it must validate. It sees updating as the only logical fol- 
lowup after validating. In addition, BitTorrent allows 
for an update mechanism which does not require large 
monolithic file servers; thus it alleviates the bulk of the 
scaling issues for large installations. In addition, BitTor- 
rent allows for multiple systems to distribute the image 
without much additional logic. This makes it easier to 
make the system most robust overall. 


The mechanisms used in Moobi provide addi- 
tional availability for applications running on Moobi 
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nodes. Not only does Moobi reduce the MTTR, but it 
also provides several hooks for system hardening. Peri- 
odic image checking provides a level of confidence in 
the sanctity of the software being used. In addition, the 
static file system images, such as the software image 
/usr, can be mounted with read-only permissions. Appli- 
cation stacks could be wrapped up in a similar manner 
to provide extra confidence in the stack. 


Going to a hybrid solution also acts as a stepping 
stone. Most organizations grow organically from indi- 
vidual system instances to dataless systems. Hard 
drives tend to remain an item in new systems, and are 
not removed from repurposed systems. 


Related Work 


Given Moobi’s hybrid nature, similar work falls 
into either the imaging and installation category or in- 
to the diskless client category, yet neither category 
completely describes it. Unlike imaging systems, it is 
a single boot to get to a running system; unlike disk- 
less systems, it is less fragile as it does not continuous- 
ly depend on central servers. 


Several commercial and noncommercial imaging 
systems exist to date. Norton Ghost [2] is the industry 
standard for commercial system imaging software. 
Several noncommercial systems provide similar fea- 
tures to Ghost, and have been able to address many of 
the distribution issues. Frisbee [3], in particular, has 
been shown to have very good image distribution per- 
formance [4]. Partition Image [5] provides a wide 
range of image support, but is not very efficient in its 
distribution. 


Typical system imaging packages use one of two 
network mechanisms for the transporting of the golden 
image to the target machine: a unicast transfer, or a mul- 
ticast or broadcast blast. Unicast transfers can be very 
expensive on the boot servers. Broadcast is very stress- 
ing on the network. Multicast requires advanced syn- 
chronization techniques and places additional con- 
straints on how and when nodes can come online. In 
contrast, a BitTorrent swarm limits server load, limits 
the scope of network usage to the ports involved, and 
automatically handles synchronization — if a BitTorrent 
client is late to the swarm, its peers will catch it up. 


All imaging systems require some mechanism to 
address host specific file overlays. Typically, this is done 
by just copying down the necessary files from a central 
system. Ghost in particular is very good at performing 
post imaging instance fix-ups to reduce unique identifier 
conflict without needing a central software repository. 
However, it does not address non-windows operating 
systems or other post installation fix-ups, and expects 
the environments to fix it either manually, or via a larger 
control system — Active Directory for instance. 


By far, the prominent implementation of diskless 


linux is the Linux Terminal Server Project [6]. How- 
ever, diskless nodes have a long history under other 
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operating systems, and also have a long history of per- 
formance issues with a central server [7]. Traugott and 
Huddleston [8] highlight Sun Autoclient’s CacheFS 
features to reduce this. Moobi reduces these issues to 
occurring only at boot time, and allows for redundant 
boot servers with little or no synchronization between 
them. 


BitTorrent is heavily in use in distributing large 
software images to large audiences. Many linux distri- 
butions, live CDs, large software applications, and 
VMWare appliance images [9] are distributed using 
BitTorrent. There are discussions of integrating Bit- 
Torrent with linux distributions’ installation or update 
systems, but this discussion has produced little. 


Recently, SystemImager [10] of the System Instal- 
lation Suite has incorporated BitTorrent as a distribution 
method for its imaging system. Moobi and SystemIm- 
ager run into many similar obstacles, especially when 
handling image versus host specific overlays, and when 
attempting to achieve high performance and scalability. 
However, SystemImager still views the many systems 
as single instances and can fall into divergent configura- 
tions. In addition, SystemImager has a two boot imaging 
operation — one boot for the imaging, and another boot 
to start the imaged system for normal operation. 


Moobi Design 
Moobi Basics 


Several aspects are assumed for the Moobi image 
building and deployment discussion: how Moobi lever- 
ages the linux boot process, how Moobi distributes its 
own software, how additional configuration information 
is passed to identical nodes, and host specific overlays. 


First, Moobi intercepts the normal linux boot 
process to prepare a specialized environment prior to 
handing off to the standard init process. 


Almesberger [11] describes the linux boot process 
in great detail. The relevant portions for this discussion 
are summarized here. With loadable modules being as 
prominent as they are, almost all recent linux boots re- 
quire the use of the initrd. Typically, the initrd is a pseu- 
do-root file system which contains the kernel module 
utilities as well as any modules necessary for the system 
be able to mount the true root file system. If this is a lo- 
cal disk, disk drivers and file system drives are loaded. If 
this is a network file system, the NFS modules are load- 
ed. Once the kernel is done with its primary initializa- 
tion, it mounts the initrd as a ramdisk and executes the 
linuxre script contained therein. Typically, the linuxre 
script is simplified to the point of just loading the mod- 
ules necessary and then hands off execution to init. Each 
initrd is built during any kernel installation or upgrade — 
the build process tailors it for the system at hand. 


Moobi, like LTSP and many other alternative 
boot systems, uses the initrd and linuxrc as its spring- 
board. An initrd for a Moobi booted system ends up 
being the full root file system. The linuxre detects the 
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hardware coarsely, and initializes the network inter- 
faces. It then connects to the boot server and retrieves 
the necessary torrent file for its image, starts up Bit- 
Torrent, and joins the swarm in progress. After the im- 
age is complete, it downloads the host specific over- 
lay, then cleans up after itself and hands off normal 
boot execution to init. 


For this to work, the kernel image and the initrd 
must be available to the booting node. The bios’s 
PXEBOOT is used to retrieve pxelinux [12] which in 
turn retrieves the kernel image and initrd. In this case, 
pxelinux is the boot loader Almesberger references, 
and is responsible for placing the initrd into the 
ramdisk and then handing off to the kernel. 


Moobi maintains a self contained instance of its 
software on each node during boot. This self contained 
instance is kept under /tmp/bootbin and /tmp/bootlib. 
By making it self contained, Moobi is removed from 
most dependencies on distribution resources. Howev- 
er, there is still a minimal expectation on the remain- 
der of the root file system, such as standard shell and 
bin utils. Moobi cleans up this instance after the pre- 
init setup to avoid taking up precious resources during 
boot. 


The ability to provide additional configuration or 
status information to the linuxre during boot is very 
minimal. Basic identity — hostname and networking 
information — is provided by DHCP and DNS. Most 
other configuration aspects can be derived from those, 
or derived from hardware detection. However, some 
information must be passed in before those. For in- 
stance, unless they are autonegotiated, link speed and 
duplex must be known prior to network initialization; 
otherwise, none of it works. 


Kernel variables are being used to pass in the 
additional configuration information needed for the 
bootstrapping step. MOOBI NET is the parameter for 
the network link speed and duplex settings example 
above. MOOBI NET can be set to AUTO, or 1000F, 
100F, or other appropriate values. For the author, this 
was necessary due to some hardware and software 
compatibility issues related to network link autonego- 
tiation. Additional parameters could be used if special 
parameters were needed to be passed for the storage 
subsystem startup, but a general hardware configura- 
tion mapping would be more useful. See Future Work 
for more discussion on this. 


As is the case in most environments, the same 
image may be used in several slightly different con- 
texts. For instance, the same software image may have 
different services enabled for it; or the same image 
may be used in multiple locations and needs to be cus- 
tomized for those locations — different network layer 
permissions such as firewall rules or tepwrapper con- 
figs. While some of this can be handled via standard 
and optional extensions to DHCP, DHCP is not neces- 
sarily the best place to transfer file data. Almost every 
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imaging system uses host overlays, and Moobi is no 
different. For minor configuration differences between 
nodes using the same image, Moobi maintains a sim- 
plified file transfer over HTTP to retrieve the appro- 
priate overlays. Other mechanisms could easily be 
swapped in for this. 


Before a node is available for Moobi, it must be 
prepared with a disk partition layout which works for 
the images in use. The partitioning scheme only need 
be done once per node for any family of images that 
use that partitioning scheme. The only requirement for 
the partitioning scheme is that an image cache parti- 
tion must exist. More advanced logic in the linuxre 
could be used to overcome this limitation of the cur- 
rent implementation. 


Moobi Image Build 


A Moobi “image”’ is actually comprised of sev- 
eral file system images, file system skeletons, and the 
kernel file. One of the true file system images is the 
root file system. Typically, the other true file system 
image is /usr. Given most installations, this is the 
largest and single most monolithic image that does not 
change. A file system skeleton is a simple text file 
which summarizes file and directory x ownership and 
permission properties for a file system. This is useful 
for the variable file systems such as /var and /tmp. 
Since these will either be local disk or ramdisk, a pure 
imaging technique provides no advantage over a sim- 
plified technique which just ensures the existence of 
the structure. 


Images can be transfered in one of three ways: 
TFTP — primarily just for the root file system image 
and the kernel file; HTTP — for the skeleton images; 
and, the preferred, BitTorrent — for the large file system 
images. The root partition is transfered via TFTP; since 
TFTP is the least efficient and robust, any transfers us- 
ing it should be minimized, and therefore the root parti- 
tion should remain as small and streamlined as possi- 
ble. Luckily, linux kernels at this point recognize gzip 
compressed initrds, so the root partition can be com- 
pressed. The skeleton files are transfered via HTTP 
since it provides a reasonable ability to recover from 
transient system or network performance failures, and 
due to the fact that multifile torrent support is limited in 
some BitTorrent client implementations. Given the av- 
erage size of a skeleton is on the order of 10K, a swarm 
is not necessary. BitTorrent is used for large file system 
images since it provides both a robust and efficient 
transfer mechanism and a native file hash checker. 


The “‘imagebuilder” script has been developed 
which aids building images from scratch. Image- 
builder takes many of the standard build system in- 
puts: a description of the file system containers or im- 
age files, pre and post installation scripts, a package 
list, and a miniroot of static file assignments. In addi- 
tion, Imagebuilder takes a series of configuration op- 
tions for the script run itself: source, destination and 
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working directories, the kernel package to be used, 
and the image name. The imagebuilder.conf uses the 
ini configuration file format. A typical file system 
configuration looks like: 
[partition:/] 
name=root.img 
size=128M 
compress=1 
skipmd5=1 
[partition:/usr] 
name“usr.img 
size=1024M 
compress=0 
skipmd5=1 
[partition:/var] 
name=var.skel 
skeleton=1 
compress=0 
skipmd5=1 
Of note is the skeleton flag which signifies that the file 
system in question is a skeleton file system. 


The imagebuilder process consists of: 

1. Creating a temporary working directory to act 
as a installation root. 

2. Running the pre-script to initialize the working 
directory. 

3. Locating, via a search path, and installing all of 
the packages identified by the package list. 

4, Copying over any additional stand alone files 
from a mini-root. The linuxre is typically kept 
and installed from here. 

5. Running the post-script on the working directo- 
ry for any last fix-ups. 

6. Building the image partition files, starting with 
the deepest path first and working up to the 
root. 


For a true file system image, it creates the image 
file by: 

6a. Generating an appropriately sized zeroed image 
file. 

6b. Creating an ext2 or ext3 file system on the im- 
age file. 

6c. Mounting the image file on a loopback device. 

6d. Rsyncing the appropriate subdirectory onto the 
image file’s mount point. Subdirectories for 
other images are explicitly excluded. 

6e. Unmounting the image file and performing any 
additional operations on the image (generating 
a checksum hash, compressing it, etc.). 


For skeleton images, it creates a text file by re- 
cursing through the appropriate subdirectory and col- 
lecting file name, ownership, and permission informa- 
tion. 


This process does not need to be followed for a 
“valid” image to be built. Any generation system 
which creates valid image files containing valid file 
systems could be used. For instance, a linux live CD 
image could be used as a followup. However, for this 
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to work, the linuxre must be able to retrieve, place, 
and open the image files. 


Moobi Image Deployment 


Once the image is built, it is moved to the boot 
server, broken up and located as appropriate for the 
image transfer mechanism. The kernel image and ini- 
trd/root file system are place in the TFTP directory. 
The skeleton files are placed in the HTTP service’s 
document directory. Large file system images are also 
located in the HTTP service’s document directory, 
since a specific location is not necessary as the BitTor- 
rent client will work any where. It also allows for sin- 
gular fix-up transfers to happen over HTTP should the 
need arise. 


The image shephard, or ishephard, process is re- 
sponsible for maintaining all of the subsystems which 
are necessary for a complete Moobi boot: any network 
configurations, the dhcp daemon, the tftp daemon, the 
http daemon, and any BitTorrent seeders. The boot 
server is the primary for all of these, but additional 
boot servers could provide equivalent services. All 
function independent of each other, and typically, the 
first to respond would be the one to be used. 


A typical Moobi node boot proceeds as shown in 
Figure 1. 


The current boot server handles multiple VLANs, 
primarily to directly serve DHCP with network helpers. 
Since it already has an interface on each VLAN, it has 
a specific seed for that VLAN. This restricts network 
traffic two-fold. No BitTorrent traffic traverses routed 
interfaces, and this reduces the need for firewall holes. 
In most but not all cases, VLANs do represent separate 
network localities. Restricting network traffic to the 
VLAN should keep network stress localized. However, 
it would not require much additional configuration if an 
environment wants to allow routed interface traversing 
— does not want to provide a trunk to the boot server, 
does not have give VLAN distribution, etc. Each image 
distribution is given a specific port which is config- 
urable and can be kept to a small range — so the firewall 
or network ACL changes could be kept minimal. 


Experimental Data 


Initial deployments have been able to boot over a 
hundred systems with 1.4 GB /usr images with an av- 
erage deployment time of around 5-6 minutes on a 100 
Mb/s ethernet network spanning multiple switches 
linked by multigig trunks. The boot failure rate with 
this high of a load has been unacceptably high at about 
5-10% in the worst case which has been primarily in 
the initial tftp transfer. 


A more rigorous set of tests where run for this 
paper. The environment consisted of 64 booting nodes, 
a DHCP/PXE boot server, and 1-4 BitTorrent seeds or 
one HTTP server. All systems were running on com- 
modity hardware with a single 2.4 GHz P4, 4 GB of 
RAM, and a 1 Gb network connection. All devices 
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were evenly split across two Cisco 4948 switches with 
a 4 Gb trunk connecting them; logically, all devices 
lived on the same VLAN. 


The runs were broken up into different load sizes 
(1, 4, 16, or 64 nodes), and into different load mecha- 
nisms or configurations. The configurations were labled: 


SEED BitTorrent distribution using a single 
seed, and the booting node continued 
sharing after it finished booting. 
BitTorrent distribution using a single 
seed, but the booting node stopped shar- 
ing after it finished booting. 
4SEED BitTorrent distribution using four seeds, 
and the booting node continued sharing 
after it finished booting. 


NOSEED 
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4NOSEED BitTorrent distribution using four seeds, 
but the booting node stopped sharing af- 
ter it finished booting. 

HTTP Download of the fixed image using wget 
over http. 


Each series of runs — load size and configuration 
— was performed five times. For the single node boot, 
only three configurations were used: SEED, 4SEED, 
and HTTP. The NOSEED and 4NOSEED configura- 
tions are identical to the SEED and 4SEEd configura- 
tions, respectively. 

A run consisted of issuing a shutdown/restart to 
all of the nodes at the same time, and measuring all 
times relative to the shutdown issuance. Time mea- 
surements were taken from time of boot, time of the 
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fixed image transfer start, time of the fixed image 
transfer end, and time of the end of boot(last com- 
mand in rc.local). From these the time for PXE trans- 
fer and fixed image transfer were calculated and re- 
ported; see Table 1. 


Node Times(s) 
Cnt. Method Total PXE D/L Failures 
SEED/NOSEED 397.8 
4SEED/NOSEED 390.8 
HTTP 245.0 
SEED 452.8 
NOSEED 450.8 
4SEED 418.6 
4NOSEED 419.8 
HTTP 277.0 
SEED 542.8 
NOSEED 490.2 
4SEED 443.4 
4NOSEED 456.8 
HTTP 449.5 
SEED 602.2 
NOSEED 623.8 
4SEED 569.0 
4NOSEED 626.2 
HTTP 1208.0 





Figure 2: Boot times. 


CPU and network usage for the serving hosts 
were watched during the course of these runs. Of note, 
only two resources appeared to be stressed: 

1. the HTTP server’s network interface during the 
fixed image transfer — expected since it is the 
only source for upwards of 128GB of data to 
transfer. 

2. the DHCP/PXE boot server’s load during tftp 
transfers — it spawns separate processes for 
each tftp connection. 

Beyond those, CPU utilization never passed 35% and 
network utilization never passed 10%. 


A kickstart was also performed as an additional 
comparison. The approximate time for the kickstart 
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through to an available system was on the order of 1200 
seconds. Given the distribution method for the kick- 
start(single source HTTP), the expected time would in- 
crease as the HTTP methods from above. 

Observation 1: As expected, a peer-to-peer distri- 
bution system works exceedingly well in scaling to 
many nodes. 

Observation 2: The transfer time for the PXE 
portion grows as a single source. This leads to boot 
failures which require manual intervention. 

Observation 3: There is only a minor improve- 
ment when providing additional seed nodes. 

Observation 4: The time for a seed image distribu- 
tion compared to an installed mechanism was approxi- 
mately 1:4. 


900 
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Figure 3: Boot Times with a Rate Limited BitTorrent. 


Observation 5: The largest influencer of distribu- 
tion time is the initial seed’s transfer rate. Even for a 
large environment — 64 nodes — the download time 
was very close to the case where one node would 
download at the transfer rate. 


Transfer . 
Rate pene Ose Theoretical 
Limit Total PXE D/L Time 





Table 2: Boot Times with a Rate Limited BitTorrent. 


Summary 


In general, Moobi performs very well for large in- 
stallations. It appears to have a break point on the order 
of 10s of nodes at which it is equivalent to installed or 
imaged systems from a single node. In addition, it does 
not show significant performance degradation as the 
number of clients increases. Given its efficient scaling, it 
allow for large clusters of systems which are image up- 
dated on a very regular basis, and this allows for a shift 
in the way those systems are managed. 
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Critique 


Moobi is reliant upon the node’s bios’s pxe imple- 
mentation, either system or NIC, for proper behavior. 
Several pxe implementations which the author has 
worked with are not very robust. They either are unable 
to recover from an overloaded boot server to hand out a 
response in a given unconfigurable timeframe, or they 
are unable to recover from a slow tftp transfer — once 
the tftp has slowed down, the client will not receive da- 
ta any faster then the lowest previous data send even if 
the server is able to recover and begin sending more. 
Additional boot servers, a more efficient tftp server, or 
better bios support could overcome this. 


Several pieces of the current Moobi implementa- 
tion are specific to RedHat-like distributions. This is 
just a limitation of this implementation and not a limi- 
tation of the technique. RPMs could be replaced with 
DEBs. The python BitTorrent client and related host- 
files tools could be replaced with compiled or other 
scripting languages better suited for the desired distri- 
bution. 


Currently, the linuxre used within Moobi is very 
specific to the structure of the image in use. This is a 
feature and failing of Moobi since it requires in depth 
knowledge of the process to be able to boot systems. 
This could be made to be more robust and accept dif- 
ferent image structures as an output of the image 
building process. 


Future Work 


Advanced hardware detection. Currently, this 
is only variable to three possible hardware configura- 
tions and so the detection mechanism is very minimal 
— mainly observing the hard drive(IDE versus SCSI) 
and the ethernet interfaces. This fails to scale as is, but 
two approaches can be attempted for this. The first is a 
relatively small hardware lookup table. Since many 
large organizations use a limited set of vendors and 
system configurations, a small table will most likely 
be sufficient. This could be passed in as a kernel com- 
mand line parameter (MOOBI_ HARDWARE ID or 
so), or a similar detection of base hardware assets. 


Image overlays. The image overlay system is 
very immature. It currently just lists files to be trans- 
fered with the correct properties. This would be an 
ideal place to drop in high level configuration man- 
agement systems, such as Cfengine or Puppet, for a 
more manageable approach. 


Partial image updates. The current usage of 
Moobi treats each new image as a completely different 
item. This means that all updates involve transferring 
the entire new image, and Moobi has certainly been 
optimized for this. However, an additional incremental 
improvement can be achieved when acknowledging 
that most of the time the primary image is not very 
different. Typically, only a small fraction of the image 
description changes — a new software package or such 
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—and that corresponds to only a small fraction of the 
fixed image file changing. Given BitTorrent’s block- 
ing scheme, it would pick up that only the blocks with 
changes need to be sent, and could easily reduce the 
image transfer. 


BitTorrent locality information. BitTorrent is 
very efficient, yet it does not know anything about its 
local world. All clients are equivalent to it. Since 
many large cluster nodes are grouped by switch, it 
would be convenient to leverage this so as to not satu- 
rate any switch to switch links. 
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ABSTRACT 


The high rate of requirement changes make system administration a complex task. This com- 
plexity is further influenced by the increasing scale, unpredictable behaviour of software and diver- 
sity in terms of hardware and software. In order to deal with this complexity, configuration manage- 
ment solutions have been proposed. The processes that many configuration management solutions 
advocate are kept close to manual system administration. This approach has failed to address the 
complexity of system administration in the real world. In this paper, we propose PoDIM: a high- 
level language for configuration management. In contrast to many existing configuration manage- 
ment solutions, PoDIM allows modeling of cross machine constraints. We provide an overview of 
the PoDIM notation, describe a case study and present a prototype. We believe that high-level lan- 
guages are needed to reduce system administration complexity. PoDIM is one step in that direction. 


Introduction 


The fact is that configuration errors are the big- 
gest contributors to service failures (between 40% and 
51%). Configuration errors also take the longest time 
to repair [37, 36, 34]. As the complexity of computer 
infrastructures increases, the risk of configuration er- 
rors increases likewise and introduces even higher 
change costs. Changes to a configuration can be tech- 
nically — such as software upgrades — or business ori- 
ented. A difficulty with configuration changes is the 
high number of dependencies between systems. Sys- 
tems do not operate in isolation, but in a network. A 
change in the configuration of one networked service 
may cause a complex chain of changes in dependent 
services. Furthermore, infrastructural complexity is in- 
fluenced by increasing scale, unpredictability in soft- 
ware behaviour and systems variety [21, 39, 4]. 

1. scale: The number of network devices, servers, 
desktops and laptops in a typical infrastructure 
is increasing significantly. New kinds of de- 
vices such as PDA’s, mobile phones and sensor 
nodes are extending the scope of an organiza- 
tional computer infrastructure. 

2. unpredictability in software behaviour: In- 
creasingly complex software systems tend to 
have more bugs, viruses and vulnerabilities. 
Bugs in software, viruses and vulnerabilities 
make full control over the system’s behaviour 
an illusion [13]. 

3. systems variety: Computer infrastructures have 
a large variety in terms of hardware platforms, 
operating systems and application software. Our 
definition of infrastructures includes not only 
desktops, servers and laptops, but also embedded 
devices such as palmtops, mobile phones and 
network devices such as routers and switches. 
All of these devices run on a variety of operating 
systems and accompanying application software. 


Using a network shell or a configuration manage- 
ment language whose process is close to manual system 
administration simply does not work in large and varied 
computer infrastructures with complex software systems. 
Indeed, the subtle interactions between (different ver- 
sions of) software packages can make systems with the 
same operating system and hardware platform unique. 
According to [5], the cost per unit becomes excessively 
large when using manual management processes. More 
loose, higher-level, processes are necessary. 


PoDIM abstracts from systems variety and al- 
lows, more than existing configuration management 
languages, expressing an administrator’s intentions. 
Expressing intentions is clearly of a higher-level na- 
ture than expressing, for example, what lines in the 
sendmail.cf file need to be modified on a mail relay. 
The key concept of PoDIM’s high-level language is 
that it allows modeling of cross machine constraints. 


The remainder of this article is structured as fol- 
lows. First, we introduce PoDIM as a high-level lan- 
guage for configuration management. Next, We elabo- 
rate on PoDIM in the sections on Configuration De- 
scriptions and Rules. The run-time semantics of PoDIM 
are introduced in the Prototype Section. We also present 
a case study. We end with sections on related and future 
work. 


Language Overview 


The state of the art in configuration management 
allows assigning roles to machines and setting high- 
level parameters for those roles. “Configure machine 
X to be a web server” and “‘configure machine Y to 
be DHCP server” are examples of role assignments. 
“The web server must run on port 80” is an example 
of a parameter assignment. The PoDIM language aims 
for a higher level of abstraction. Instead of role assign- 
ments, we want to express things such as: “One of my 
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servers must be configured as a web server”’ and “On 
every subnet, there must be two DHCP servers.” In- 
stead of parameter assignments, we want to express 
things such as: “A web server must use a port number 
higher than 1024.” 


PoDIM’s language consists of a rule language 
and a domain model. The distinction between domain 
model and rule language is a recurring theme in policy 
languages. A domain model provides a description of 
the domain in which a rule language solves problems. 
Since we are dealing with the domain of configuration 
management, the domain model contains descriptions 
for things such as DHCP servers and web servers. The 
rule language defines types of rules and how they in- 
teract with the domain model. A system administrator 
writes rules in PoDIM’s rule language. These rules in- 
teract with the domain model and output a configura- 
tion for each managed device. The next section elabo- 
rates on the domain model. Subsequently, we describe 
PoDIM’s rule language. 


The basic principle of PoDIM’s runtime is that 
each real world object is simulated in the system. The 
different classes of objects are defined in the domain 
model. Examples of object classes are devices, net- 
work interfaces and services such as DHCP servers 
and web servers. PoDIM’s runtime takes a set of poli- 
cy rules as input and tries to satisfy these rules by cre- 
ating objects and setting parameters of objects. In do- 
ing so, it generates a configuration for each managed 
device. Existing tools, such as Cfengine [12, 10, 11, 
14], can then be used to deploy the generated configu- 
ration on each real world device. 


Configuration Descriptions 


PoDIM’s domain model is object-oriented. This 
means that the “‘things’’ in the domain model are cod- 
ed as classes. Examples of classes are DHCP server 
and web server. We use an existing object-oriented 
programming language for coding classes, named Eif- 
fel (30, 31]. 





+ 
DHCP_SERVER 
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Figure | shows a simplified graphical representa- 
tion of the domain model in the BON notation [40]. 
All classes, such as DHCP server and web server, 
have one common ancestor: ENTITY. The ENTITY 
class is used for modeling common functionalities. 
The arrows denote inheritance relationships. The sim- 
plified example presented in Figure | uses only single 
inheritance relationships. In real life examples, multi- 
ple inheritance is often necessary. Eiffel supports this. 


Classes define the interface of a software subsys- 
tem in PoDIM. A class has attributes that can be set, 
queries that can be executed and commands that can be 
executed. For example, the WEB SERVER class has 
an attribute for setting its port, a query to find out the 
administrative mail address and a command to enable 
php support on the server. Attributes are set by the sys- 
tem administrator when writing rules. Queries are used 
by the system administrator and other objects to gather 
information about the runtime system. Commands are 
used as an inter-object communication mechanism. An 
example of the latter occurs when a webmail object 
commands a web server to enable php support. 


In the rest of this section, we elaborate on the 
definition of classes. We start with attributes and 
queries. Attributes are an object’s data structures. 
Queries define the questions one can ask an object. 
Next, we discuss commands. Commands define how 
objects can change each other’s state. We end this sec- 
tion with a description on how dependencies are mod- 
eled between classes. 


Attributes and Queries 


Attributes define the data structures for objects of 
a class. Queries are methods which return a result. In 
Eiffel, all attributes are also queries by definition, i.e., 
objects can query each other’s attributes. An object can 
only modify another object’s attributes by using com- 
mands. The result of a query is computed based on the 
results of other queries or the values of attributes. The 
example in Listing 1 shows a partial web server class. It 
defines two attributes: “\php_ supported” and “domain” 


it 
DHCP_DAEMON : 
+ + 
DHCP_RELAY Cons server» WEB_SERVER 


Figure 1: Domain model in BON [40] notation. All units of functionality such as mail servers and DNS clients are 
modeled as classes. All classes have one common ancestor: ENTITY. The arrows denote inheritance relation- 


ships. 
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and one query: “administrator_email”. In the example, 
the “‘administrator_email’’ query is based on the at- 
tribute “domain”. All attributes and queries have a 
type. For example, the “‘php_supported”’ attribute has 
type BOOLEAN. Note that the definition of WEB_ 
SERVER is used as an illustration, not as an introduc- 
tion to a real world WEB_SERVER class. 


Commands 


Commands are methods that change the state of 
an object (i.e., modify its attributes), but do not return 
a result. The example in Listing 1 contains a partial 
web server class with one command: “enable_php”’. 
The ‘“‘enable_php” command changes the value of the 
“‘php_supported” attribute. Since a command can 
contain arbitrary code, its behaviour should be clearly 
documented. For this documentation, we use another 
feature of Eiffel: preconditions and postconditions. 
Preconditions express conditions that need to be true 
before the command is executed while postconditions 
express the effects of the command’s execution. In our 
web server example, the precondition of the “en- 
able_php”’ command is that php support is not yet en- 
abled. Its postcondition expresses that php support 
will be enabled when the command is executed. 


It is also possible to control access to commands, 
i.e., prohibit objects to execute commands on other ob- 
jects. In the web server example, we could only allow 
objects that run on the same device to enable php sup- 
port. In the Section on authorizing commands, we elab- 
orate on how to specify access controls for objects. 


Modeling Dependencies 


A lot of dependencies exist between classes (and 
their real world software configurations). Imagine a 
startup class that is responsible for generating the 


01 class WEB SERVER 


03 feature -- Attributes 

05 php_supported: BOOLEAN 

07 domain: STRING 

09 feature -- Queries 

LA: administrator email: STRING is 
12 do 

13 Result := "webmaster@" 
14 end 

16 feature -- Commands 

18 enable php is 

19 require 

20 php_supported = False 
21 do 

22 php_supported := True 
23 ensure 

24 php_supported = True 
25 end 

27 end 
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/etc/init.d directory on Linux systems.’ Both the web 
server and DHCP server classes depend on the startup 
system. Indeed, if these two network services have no 
hook into the startup system, they are not activated when 
we reboot a machine. Another example of a dependency 
is the relationship between the implementation of a ser- 
vice and its attributes. Imagine a web server class that 
supports two web server implementations: apache and 
publicfile. Apache supports php, publicfile does not. In 
this case, php support can never be enabled on a web 
server object if it uses publicfile as its implementation. 


To make these kinds of dependencies explicit in 
our domain model, we use two language constructs of 
Eiffel: references and invariants. When declaring an 
attribute in a class, it will contain a reference to anoth- 
er object and not to the contents of the actual object. 
For example, the dependency of the web server on the 
startup system is modeled as a reference in Listing 2 
on line 9. Invariants are arbitrary boolean expressions 
that are required to be true at all times during an ob- 
ject’s lifetime. They provide a built in mechanism for 
modeling fine grained dependencies (and other restric- 
tions on an object’s attributes state). For example, the 
relationship between php support and the chosen im- 
plementation is modeled in Listing 2 on line 12. 


Making dependencies that exist in an IT infra- 
structure explicit in the domain model has two advan- 
tages. First, they can be used as a documentation aid. 
Second, a dependency violation can be detected by the 
PoDIM runtime. For example, when one rule states 
that the attribute “implementation” of a web server 
class must be set to “publicfile’ and another rule 


1Classes can be used to abstract away from details like the 
operating system used and different software versions. For 
example, the same startup object results in other files being 
generated on Linux and BSD systems. 


+ domain 


Listing 1: This partial WEB_SERVER class defines two attributes: “‘php_supported” and “domain”, one query: 
“administrator_email”’ and one command: “‘enable_php’”’. 
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states that the attribute “php supported” must be set 
to true, a dependency violation is detected. The default 
behaviour is to signal an error and abort. 


Rules 


The rule language is the user interface for the 
system administrator. It defines rules for expressing 
how the configuration of an infrastructure must look 
like. Remember that each real world object is simulat- 
ed with PoDIM. For example, there exists an object 
for each device in your system, each network interface 
and every service that needs to be configured. The do- 
main model presented in the previous section is a stat- 
ic description of the possible classes that can exist in 
the system. The rule language is used to create and 
manipulate objects. 


A distinguishing feature of our rule language is 
that it allows the specification of constraints. We 
demonstrate the need for constraints with two exam- 
ples.? 

¢ When configuring a web server, the port is one 
of the attributes that can be set. A constraint al- 
lows expressing things such as “‘the port should 
be set to 80 or a value higher than 1024.” In 
contrast, a regular assignment only allows ex- 
pressing things such as “the port should be set 
to the value 80.” 
® Servers typically have roles assigned which de- 
termine the services they must offer. For exam- 

ple, one can state that system X is going to be a 

web and mail server. By using constraints we 

can express things such as: “‘A device should 
not provide more than four network services,” 

**T want two DHCP servers on each subnet,” or 

‘One of my servers should configure itself as a 

web server.” 


Since the domain model is object-oriented, the 
rule language contains rules to create objects and 
modify the attributes of objects. Remember that a 
class also defines commands for its objects. Com- 
mands allow objects to change each other’s behaviour. 
The rule language also allows access controls between 
objects to be defined. In the rest of this section, we 


2Since this work is about configuration management, we 
use examples from this domain. Nevertheless, the rule lan- 
guage is generic enough to apply to other domains. 


01 class WEB SERVER 
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elaborate on the three types of rules: creating objects, 
modifying objects and authorizing commands. 


Creating Objects 


As a system administrator, you configure the net- 
work to offer services. This results in assigning a set of 
roles to each device in the network. For example, ma- 
chine A acts as a web server and DHCP server. Ma- 
chine B acts as a DNS server. All machines act as IPv4 
nodes or routers. PoDIM’s creation rules express role 
assignments precisely. Since every real world object is 
simulated in the PoDIM runtime, rules must exist for all 
real world objects to be created. In general, a creation 
rule instructs a set of objects to create other objects. For 
example, we instruct machine A to create a web server 
and a DHCP server. How do we know that a simulated 
object of machine A exists in the system? We can not 
assume this, so we have to create it with a creation rule. 
But then again, which object needs to create machine 
A? To get out this bootstrapping problem, we assume 
the presence of one object of a predefined class called 
SYSTEM_ENTITY. 


Listing 3 shows a creation rule to create machine 
A. The rule contains three parts: The first part on line 
1 specifies the rule type. In this case, we want to write 
a creation rule. The rule type can be extended with an 
optional rule identifier, in this case ““machine_A’’. The 
second part states which object needs to be created. In 
this case, we want to create a DEVICE object (line 2). 
We also specify one initial attribute for the device ob- 
ject that is going to be created on line 3. The attribute 
“name” will be set to ““machine_A”. The third and 
last part on line 4, after the “select”? keyword, speci- 
fies which object(s) need to execute this rule. In this 
case, we want all objects of class SYSTEM_ENTITY 
to execute this rule. By definition, only one SYS- 
TEM_ENTITY object exists, so this rule will only be 
executed by one object. In plain English, the rule in 
Listing 3 reads as “A device object with name ma- 
chine_A must be created.” 


01 creation machine_A 


02 DEVICE 
03 name "machine _A" 
04 select SYSTEM _ENTITY 


Listing 3: This creation rule reads as “A device with 
name machine_A must be created.” 


03 feature -- Attributes 

05 implementation: STRING 

07 php_supported: BOOLEAN 

09 startup: STARTUP 

11 invariant 

12 implementation.is_equal("publicfile") implies not php_supported 
13 end 


Listing 2: This partial web server class illustrates the invariant mechanism to model the dependency between the im- 
plementation and php support and the reference mechanism to model the dependency between the web server 


and startup system. 
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Now that we can write rules to create objects for 
all managed devices, it is time to enable some function- 
alities on those devices. For example, all devices need 
to resolve host names to addresses. Consequently, we 
need to enable each machine’s DNS configuration. To 
enable this, we assume the presence of a DNS_CLIENT 
class in the domain model. Listing 4 shows how to ex- 
press that every machine should configure itself as a 
DNS client. The rule has “dns_ clients” as its identifier 
(line 1). In this case, the object that needs to be created 
is of class DNS_CLIENT (line 2). The objects that need 
to create a DNS_CLIENT objects are all objects of 
class DEVICE (line 3). Enabling functionality on a de- 
vice is thus equivalent to instructing a device to create 
an object that represents that functionality (in this case, 
a DNS client). In plain English, the rule in Listing 4 
reads as “All machines must act as a DNS client.” 


O01 creation dns_clients 
02 DNS_CLIENT 
03 select DEVICE 


Listing 4: This creation rule reads as “All machines 
must act as a DNS client.” 


In many cases, a more fine grained mechanism is 
needed to describe which objects need to execute a 
rule. For example, how would you say that all ma- 
chines you use as a server need to configure them- 
selves as a web server? To enable this, the part of a 
rule that selects objects on which to apply the rule can 
be further refined with a boolean expression. This 
boolean expression filters the objects that apply the 
rule. For example, in Listing 5 the selection clause on 
lines 3-4 includes all DEVICE objects, except for 
those where the boolean expression on line 4 evaluates 
to false. In plain English, this rule reads as “Machines 
with label ‘server’ act as WEB SERVER.” Note that 
we assume the presence of a “labels” attribute in the 
class DEVICE. The contents of this attribute can be 
easily set when writing rules that create devices. This 
is done in the same way as we assigned the value 
““machine_A” to the “name” attribute in Listing 3. 


01 creation mail servers 


02 WEB SERVER 
03 select DEVICE 
04 where DEVICE.labels.has("server") 


Listing 5: This creation rule reads as “‘Machines with 
label ‘“‘server” act as WEB_SERVER.” 


01 creation mail service 


02 WEB SERVER 

03 select DEVICE 

04 where DEVICE.labels.has("server") 

05 group by DEVICE.labels.has ("server") 
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The syntax of the select-clause — lines 3 and 4 of 
Listing 5 — is modeled after SQL SELECT statements 
[2]. The name of a table — class name in our case — fol- 
lows the “select”? keyword. The optional “where” 
clause excludes rows — objects conforming to the class 
name — where the boolean expression evaluates to 
false. All queries and attributes of a class can be used 
in a boolean expression. Operators are used to com- 
pose composite expressions. Listing 5 uses the feature 
call operator. Other examples of operators are: com- 
parison operators, boolean operators and arithmetic 
operators. 


In many cases, you want to express not only 
what objects need to be created on a DEVICE - or an- 
other object — but also how many need to be created. 
This is where creation constraint rules come into the 
picture. Listing 6 expresses the previously mentioned 
example that “a device should not provide more then 
four network services”. Creation constraint rules have 
an extra keyword: “‘constraint”’. The name of the class 
to be created is also prefixed with an interval. In this 
case the interval expresses that a maximum of four 
server objects can be created. Note that we are using 
the inheritance features from the domain model in this 
example. We assume that all types of network services 
such as DHCP servers and web servers inherit from 
the SERVER class. 


01 creation constraint server_objects 
02 [ 0 4 ] SERVER 
03 select DEVICE 


Listing 6: This creation constraint rule reads as “A 
device should not provide more then 4 network 
services.” 


Often, you don’t care which DEVICE will be 
your web server, as long as one — or more — devices 
are configured as web server. This can be expressed 
with the “group by” clause of the SQL SELECT syn- 
tax. The group by clause applies a rule to a group of 
objects rather than to of single objects. Listing 7 then 
reads as “One device with label ‘server’ must act as a 
web server.” 


We end with an often cited example in the con- 
text of configuration management: “‘] want two DHCP 
servers on each subnet.” The rule for this example is 
shown in Listing 8. This example combines constraint 
rules and rules with “group by” clauses. 


Listing 7: This creation rule reads as ““One device with label “‘server” must act as a web server.” 





O01 creation constraint dhcp_servers 


02 [ 2 : 2 ] DHCP_SERVER 
03 select NETWORK_INTERFACE 
04 group by NETWORK_INTERFACE.subnet_interfaces 


Listing 8: This creation constraint rule reads as “Each subnet must have two DHCP servers.” 
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Before we explain the rule itself, we introduce the 
NETWORK_INTERFACE class. In the same way as 
we can create devices, DNS clients and web servers ob- 
jects, we can create objects representing network inter- 
faces. It does not matter if an object represents hardware 
(such as device and network interface) functionality or 
software functionality (such as DNS client and web 
server). The basic concept is that the SYSTEM_ENTI- 
TY object creates DEVICE objects. DEVICE objects 
can be instructed to create other objects such as DNS _ 
CLIENT or NETWORK_INTERFACE objects. In the 
same way, NETWORK INTERFACE objects can be 
instructed to create DHCP_SERVER objects, which is 
the functionality demonstrated in Listing 8. 


The interval on line 2 limits the number of 
DHCP_SERVER objects to two. The “‘subnet_inter- 
faces” query of the NETWORK_INTERFACE object 
returns a set of all subnet interfaces in the same subnet 
as the object on which the query is executed. The re- 
sult of the “‘select” clause on lines 3-4 will be a set of 
network interface sets. Each inner set represents one 
subnet. On each of those inner sets, the rule to create 
two DHCP servers is executed, which results in two 
DHCP servers on each subnet. 

Modifying Attributes 

Once roles are assigned to devices, you want to 
tune the behaviour of those roles. Your web server 
needs a port to run on, your DHCP server needs to 
know whether it should serve fixed addresses, your 
DNS client needs to know what its domain is, and so 
forth. These examples can be expressed with PoDIM’s 
attribute assignment rules. They change the value of 
an object’s attributes. 


Let’s start with the simple case: how do we spec- 
ify the search domain for our DNS clients? The rule 
that realizes this is shown in Listing 9. Rules dealing 
with attribute assignments are called assignment rules 
(hence the keyword “‘assignment”’ on line | of Listing 
9). In general, an assignment rule consists of a series 
of attribute-value assignments that are applied to the 
objects in the select-clause. In our example, we show 
one attribute-value assignment, where the attribute is 
“search domain” and the value is ‘“‘“mydomain.com”’. 
The objects on which this assignment is applied are, in 
this case, all DNS clients. 


01 assignment dns_search_domain 
02 search_domain "“mydomain.com" 
03 select DNS_CLIENT 
Listing 9: This assignment rule reads as “All DNS 
clients have mydomain.com as their search do- 
main.” 


01 filter php_enabling 
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In some cases, you don’t care what value an ob- 
ject’s attribute has, as long as it’s within a predefined 
range. For example, you might want to express that 
“the port of all my web servers should be set to 80 or 
a value higher than 1024.” This is where assignment 
constraint rules come into the picture. Listing 10 
shows the assignment constraint rule for our example. 
The attribute to be set is called “port”. The valid val- 
ues for this attribute are the union of the singleton 80 
and all values greater than 1024. 


O01 assignment constraint webserver_ports 
02 port [ so ] © [3024.6 ] 
03 select WEB SERVER 
Listing 10: This assignment constraint rule reads as 
‘A web server’s port must be within the range 80 
or a value greater than 1024.” 


Authorizing Commands 


Many system administrators work in a team. In 
most teams, people have roles: Jack is our Linux server 
specialist, Greg is our networking guy and Bill is our 
desktop guy. In small teams, communication is easy — 
Jack, Greg and Bill are located in the same office. In 
larger teams, however, there is a need to specify roles 
more precisely and enforce those automatically. 


Since Bill is our desktop guy, we do not want 
him to configure network services of any kind. How 
do we express this? Consider the SERVER class. All 
network services like DHCP servers and web servers 
inherit from this class. The SERVER class thus repre- 
sents common functionality for network services. We 
want to express that Bill cannot modify the attributes 
of an object if it inherits from SERVER. To realize 
this, we first introduce two extra PoDIM features: a 
rule type to express access controls and support for 
writing rules about other rules. 


Recall from the discussion of commands that we 
wanted to limit access to the “enable_php” command 
on a web server to objects that run on the same device 
as the web server. To allow this, we introduce a third 
type of rule: filter rules. Remember that we already in- 
troduced creation and assignment rules. Listing 11 
shows a filter rule for the case where we want to limit 
access to the “enable_php” command of web servers. 
The filter rule blocks the execution of the “‘enable_ 
php” command on a WEB _ SERVER for every ENTI- 
TY that is not created by the same device as the 
WEB _ SERVER. 


Filter rules allow to block the execution of a 
command based on the caller object and callee. A fil- 
ter rule starts with the “filter”? keyword and has an 


02 enable php block 
03 select ENTITY, WEB SERVER 
04 where not ENTITY.device.is_equal (WEB_SERVER. device) 


Listing 11: This filter rule reads as ““PHP support on web server can only be enabled by objects on the same de- 


vice.” 
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optional identifier. It contains one or more commands 
(with optional arguments) that need to be blocked. The 
selection part on lines 3-4 is a bit different than that of 
creation and assignment rules. A filter rule always se- 
lects pairs of objects to identify a caller/callee pair. In 
SQL terminology: the SELECT clause contains a join 
of tables named ENTITY and WEB SERVER. The 
resulting tuple-set is then filtered with the “‘where” 
expression on line 4. In this case, we express that we 
want to block communications when the caller is any 
entity and the callee a WEB_SERVER (line 3). If the 
caller executes the ‘“‘enable_php’” command, it is 
blocked when the caller is not created by the same de- 
vice as the web server. 


The other feature we need are rules about other 
rules. When we want to express that Bill cannot con- 
figure network services of any kind, we need a filter 
rule that prohibits the modification of objects repre- 
senting network services. The only way Bill can modi- 
fy objects is to write assignment rules. So, we want to 
write a filter rule that blocks assignment rules written 
by Bill from being applied on network services. Re- 
member that we defined a common class for network 
services in this example, called SERVER. Since a fil- 
ter rule specifies a policy for the interaction between 
two objects and assignment rules are in this case part 
of the interaction, assignment rules themselves must 
be objects. 


Let’s go into more detail on how rules can be ob- 
jects themselves. Take for example the assignment 
rule from Listing 9. The assignment rule contains a 
rule identifier, ““dns_search_domain”’, an attribute that 
needs to be modified, “‘search_domain”’, the value for 
that attribute, ““mydomain.com”’, and the set of target 
objects, all DNS_CLIENT objects. Looking at this rule 
as an object, we have an object with attributes “‘identifi- 
er”, “attribute”, “value”, and “targets”. In this exam- 
ple, the value of “identifier” is “dns_search_domain”; 
“attribute” is set to “search_domain” and so on. 


Since rules exist as objects in our system, they 
must have a static definition (class) in the domain 
model. An updated graphical representation of our do- 
main model is shown in Figure 2. PoDIM’s three type 
of rules — creation, assignment and filter — are shown 
as classes in the domain model. 


We have now introduced all features needed for 
expressing that Bill cannot configure network ser- 
vices. This policy is represented with a filter rule 
shown in Listing 12. The filter rule deals with the in- 
teraction between ASSIGNMENT_RULE objects and 
SERVER objects. By definition, the “execute _assign- 
ment_rule’”” command is used to execute an assign- 
ment rule on an object. Since we want to forbid Bill to 


O01 filter bill_cannot_configure_services 
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configure network services, we must block the execu- 
tion of this command for rules created by Bill on 
SERVER objects. 


Note that we depend on the presence of the “‘cre- 
ator”’ attribute of an ASSIGNMENT_RULE. This at- 
tribute can be set with another assignment rule. We 
will not delve into how we can be sure that identities 
can not be spoofed. For now, it suffices that this can 
be achieved with public key cryptography and rule 
signatures. 


Notice from Figure 2 that rule classes have a 
common parent: RULE. RULE itself is a child of the 
common class MANAGED OBJECT, as is the ENTI- 
TY class that was discussed previously. In the same 
way that you can not compare a DNS client with a 
web server, it is useless to compare rules with entities. 
They both have the same structure: they contain at- 
tributes, queries and commands, but they represent 
very different things: rules represent intentions on the 
part of the system administrators while entities repre- 
sent real world objects such as devices, network inter- 
faces and network services. 





tH 
MANAGED OBJECT 





om 
CREATION RULE 


4 
( FILTER RULE Cassiowenr Aue > 


Figure 2: Domain model in BON [40] notation. This 
model includes RULE classes. RULE and ENTI- 
TY have a common parent! MANAGED OB- 
JECT. The arrows denote inheritance relation- 
ships. 


Prototype 


The prototype described below is available for 
testing from http://purl.org/podim/devel. First we de- 
scribe the rule resolution process. Next, we describe 
how a configuration is deployed on a set of machines. 


Rule Resolution 
We have seen that the basic principle of PoDIM’s 
runtime is that objects are created for each real world 


“thing”. We have also discussed how a system admin- 
istrator uses creation rules to specify which objects 


02 execute_assignment_rule(rule) block 
03 select ASSIGNMENT_RULE, SERVER 
04 where ASSIGNMENT _RULE.creator = "BillsPublicKey" 


Listing 12: This filter rule reads as “Bill cannot configure network services.” 
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need to be created. The basic form of a creation rule is 
that it states that an object or objects of a particular 
class must be created by other objects. For example, 
we can assert that all devices must create a DNS client 
object. Remember that there is a bootstrapping prob- 
lem with this approach. To solve this, we assumed the 
presence of one object of a predefined class, SYS- 
TEM_ENTITY. 


The component responsible for creating a SYS- 
TEM_ENTITY object is the translation controller. The 
translation controller contains compiled versions of all 
domain model classes. At startup, it creates a SYS- 
TEM_ENTITY object and then parses one or more pol- 
icy files. Policy files contain one or more creation, as- 
signment or filter rules. Remember that rules them- 
selves are also objects in the system. Thus, the transla- 
tion controller creates an object for each rule. 


At this moment, there is one SYSTEM_ENTITY 
object and an object for each rule. The translation con- 
troller then iterates over all available objects and asks 
them to configure themselves. This configuration process 
is different for RULE and ENTITY objects. RULE ob- 
jects check if there are new objects that conform to their 
selection clause. If there are, they attach themselves to 
those objects. 


The configuration process of an ENTITY object 
starts with checking all attached creation rules. The 
creation rules are sorted by class name. Remember 
that classes represent things such as DHCP servers 
and web servers. For each class name, the intersection 
of all creation constraints is computed. Creation con- 
straints are constraints on the number of objects of 
each class name. If the intersection of all creation con- 
straints is empty, an error is generated. Else, the mini- 
mum number of objects is created to satisfy all cre- 
ation rules. 


Next, all attached assignment rules are checked 
and sorted per attribute name. For each attribute, the 
set of allowable values is computed. If this set is emp- 
ty, an error is generated. Else, the number of elements 
in the set is computed. If there is only one element, the 
attribute’s value can be assigned. Else, one value is 
chosen from the set. The algorithm that chooses one 
value from a set can be redefined in each class. For 
example, the algorithm for choosing a valid port on a 
web server will have to take into account ports chosen 
by other services on the same device. The algorithm 
for choosing a valid IPv4 address from a set will have 
to take into account the network address of its subnet 
and addresses already assigned to other devices on the 
same subnet. 


After all objects have been asked to configure 
themselves for the first time, the whole process is re- 
peated. In practice, SYSTEM_ENTITY will create a 
number of DEVICE objects based on its attached cre- 
ation rules. In the next run of the configuration 
process, DEVICE objects will create other objects 
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representing services like DHCP servers and web 
servers. 


The configuration process continues until all ex- 
isting objects reach a stable state. A stable state for an 
object is defined as follows: all rules attached to the 
object are satisfied. A class can extend the definition 
of a stable state. In the RULE classes, for example, the 
definition of stable state is extended with the require- 
ment that a rule must be attached to all objects satisfy- 
ing its selection clause. For a web server class, the 
definition can be extended with the requirement that 
the port attribute must have a value, even if no rules 
exist that set the port attribute. Determining values for 
attributes for which no rule exist is done by calling an 
extra method after the configuration process of each 
object. By default this method contains nothing, but 
objects can redefine it. For example, the web server 
class can define this method to set the port attribute to 
80 if no rules exist for this attribute. 


It is possible that a stable state is never reached. 
First, a class definition can be erroneous. The specifica- 
tion of what is a stable state can be ill-defined. The ex- 
tra method that can be defined in each class for addi- 
tional configuration can also contain errors that prevent 
objects from the class (or other objects) to reach a sta- 
ble state. Second, because of the complex (multiple) in- 
heritance relationships that can exist between classes it 
is possible that the creation rules are never satisfied. 


The enforcement of filter rules is done when ob- 
jects execute methods on each other. Before executing 
a method, the attached filter rules are checked. If a 
block policy exists, the execution is not allowed. 


Configuration Deployment 


When the translation controller notices that all 
objects have reached a stable state, the deployment 
process is started. The goal of the deployment process 
is to generate configuration files from the created ob- 
jects and deploy these files on all managed devices. 


The first phase is to output an XML-based repre- 
sentation of the in-memory objects. This is done by 
asking the SYSTEM_ENTITY object to output its con- 
figuration. The SYSTEM_ENTITY object asks all its 
children (which are DEVICE objects) to output their 
configuration. The DEVICE objects in turn, ask their 
children to output their configuration and so on. The re- 
sult is that, for each DEVICE, a tree-structured XML 
profile is created. This profile consists of simple at- 
tribute-value assignments for all attributes and queries 
of an object. The format of this XML representation de- 
fined in Anderson’s and Smith’s LISA 2005 paper [8]. 


Next, the XML profiles are used as input for a 
template engine which generates configuration files and 
associated configuration instructions. Except configura- 
tion files themselves, everything that can be changed in 
a system is defined as a configuration instruction. Ex- 
amples are: settings permissions and ownerships of 
files, installing packages, restarting software services 


268 21st Large Installation System Administration Conference (LISA ’07) 


Delaet & Joosen 


and creating links. The format of configuration instruc- 
tion is XML-based and is derived from the internal 
XML format that Befg2 [18, 20, 19] uses. The grammar 
of the format can be found on http://purl.org/podim/ 
devel . The configuration instructions are then translat- 
ed to the languages used by one of the deployment 
backends. The prototype allows multiple deployment 
backends to be used. For example, it is possible to 
translate configuration instructions to Cfengine [12, 
10, 11, 14], Befg2 [18, 20, 19] or Lefg [7, 3, 6] speci- 
fications. It is also possible to add additional deploy- 
ment backends. 


Case Study 


To validate our system, we use the IPv4 address- 
ing policies for the Computer Science Department of 
the K. U. Leuven (CSNet). CSNet has a total of 600 
machines in about 20 subnets. The 134.58.39.0-134.58. 
47.255 block of addresses is assigned to CSNet. CSNet 
has two connections to the university-wide network. 
The main connection is a subnet that contains, besides 
the external router for CSNet, switches from other de- 
partments and a router that connects to the main K. U. 
Leuven backbone. One lab is connected to the private 
network of the K. U. Leuven. We want to assign static 
addresses to all network interfaces. Some subnets need 
private addresses. Private addresses are used by the lab 
networks and the network of the departmental admini- 
stration, since this is a Windows network which is 
safer behind a NAT device. 


Besides classes for modeling devices and net- 
work interfaces, we need a class to model a static IPv4 
address configuration. This class is shown in Listing 
13. The class contains a reference to a network inter- 
face (line 6), the value for its address (line 9) and its 
network (line 12). The class also defines a query that 
returns the netmask (lines 17-20). 


01 class 

02 NETWORK _IPV4 STATIC ADDRESS 
04 feature -- Attributes 

06 interface: NETWORK INTERFACE 
07 -- attached interface 

09 address: IPV4 ADDRESS 

10 -- IPv4 address 

12 network: IPV4 NETWORK 

13 -- subnet configuration 
15 feature -- Queries 

17 netmask: INTEGER is 

18 do 

19 Result := network.netmask 
20 end 

22 end 


Listing 13: Class definition of a static IPv4 address. 





The IPv4 addressing policies for CSNet are de- 
scribed in Listing 14. Because of space limitations, we 
omitted the creation of device and network interface 
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objects, representing the hardware configuration of 
our infrastructure. Notice that, except for a few corner 
cases (the networks providing external access), all de- 
vices are managed with the first three constraint rules: 
one creation constraint rule that that creates static IPv4 
address configuration and two rules for configuring 
the private and public address space. In plain English, 
the rules in Listing 14 read as follows. 

1. Rule on lines 3-8: All network interfaces must 
have one static IPv4 address configured, except 
for the interface of “‘jasje” on the external ac- 
cess subnet (KULEUVENNET). “‘Jasje” is our 
network sniffer. 

2. Rule on lines 10-14: All network interfaces 
that must be reachable from the Internet must 
have an IPv4 address in the range 134.58.39.0 - 
134.58.47.255. 

3. Rule on lines 16-20: All network interfaces on 
private subnets must have an IPv4 address in 
the range 192.168.0.0 - 195.168.255.255. 

4. Rule on lines 22-26: The network interfaces on 
the PC_KLAS subnet must have an IPv4 ad- 
dress on the 10.2.15.0/24 subnet. 

5. Rule on lines 28-32: The access switch on the 
PC_KLAS subnet must have the 10.2.15.254 
address. 

6. Rule on lines 34-38: All interfaces in the exter- 
nal access subnet - KULEUVENNET — must 
have an IPv4 address on the 134.58.254.64/29 
subnet. 

7. Rule on lines 40-45: The gateway of the exter- 
nal access subnet must have the 134.58.254.70 
address. 


As discussed previously, the translation con- 
troller reads the policy rules and tries to find a stable 
state. If the latter succeeds, configuration files are gen- 
erated by the template engine. Based on the operating 
system of a device, a template file is chosen. This tem- 
plate then generates configuration files. For example, 
for OpenBSD devices, /etc/hostname.xxx files are 
generated. For Debian GNU/Linux devices, /etc/net- 
work/interfaces are generated. On Cicso routers and 
switches, one global configuration file is generated. 
Depending on the mechanics of the chosen deploy- 
ment engine (Cfengine, Bcfg2, Lefg, ...), configura- 
tion files and are then transported to and deployed on 
a device. 


Related Work 


Related work of PoDIM’s high-level configuration 
language includes configuration management tools. We 
also discuss how generic policy languages and a model 
finder are related to the problem PoDIM is trying to 
solve. We end this discussion with a characterization of 
application deployment frameworks. 


Configuration Management Tools 


Befg2 [18, 20, 19], Cfengine [12, 10, 11, 14], 
LCFG [7, 3, 6] and Puppet [26, 27] are the most cited 
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configuration management tools. As discussed previ- 
ously, these tools can be used as a deployment back- 
end for PoDIM. Bcfg2 and Cfengine are in the first 
place deployment engines. LCFG and Puppet include 
capabilities for modeling dependencies between con- 
figurations. 


Other related work in the context of configuration 
management includes the work of Couch on closures 
[16]. Closures are defined as functional units that can ac- 
cept commands from the user or other closures. Their in- 
ternal mechanics are hidden. The classes from PoDIM’s 
domain model can be seen as closures. Classes define 
commands that change the behaviour of their objects, 
but can also have queries and attributes. 


Policy Languages 
PoDIM separates the domain model and the poli- 
cy specification language. Many other policy lan- 


guages use this separation. It allows for reuse of both 
the policy specification language and domain model in 


-- IPv4 Addressing 


creation 
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different contexts. The PCIM [33, 32] (Policy Core In- 
formation Model) and CIM (Common Information 
Model) [15] initiatives define a generic model for rep- 
resenting policy specifications on one side and a set of 
domain classes on the other side. The generic model de- 
fines policy rules in a Condition-Action format. The 
domain model includes definitions for common net- 
work functionalities such as routing protocols, network 
configurations and IPSec configurations. The domain 
model itself is object-oriented and models relations 
between classes. The CIM domain model provides a 
valuable repository of existing domain knowledge, 
modeled as object-oriented classes. The CIM model is 
very similar to the PODIM domain model. However, it 
has no support for specifying fine-grained dependen- 
cies. PoDIM uses Eiffel invariants for this. PCIM/CIM 
also does not support constraint handling. 


Other frequently cited policy languages such as 
Ponder [17] and JRules [23] also offer an extensible 
domain model. The domain model of these languages 


-- Each interface has 1 IPv4 address, except "jasje" 


select NETWORK_INTERFACE 


where NETWORK_INTERFACE.device.name /= "jasje" and 
NETWORK_INTERFACE.1labels.has ("KULEUVENNET" ) 


1 
3 
4 
3 [1-1] NETWORK_IPV4_STATIC_ADDRESS 
6 
7 
8 


10 assignment constraint 
1 -- Public address space 


12 address [!!IPV4_ADDRESS.make("134.58.39.0") 


13 select NETWORK_IPV4_STATIC_ADDRESS 


14 where NETWORK _IPV4_STATIC_ADDRESS.interface 


16 assignment constraint 
17 -- Private address space 


18 address [!!IPV4_ADDRESS.make("192.168.0.0") 


19 select NETWORK_IPV4_STATIC_ADDRESS 


20 where NETWORK _IPV4_STATIC_ADDRESS.interface 


22 assignment 
as -- PC_KLAS IPv4 Address range 


24 network !!IPV4_NETWORK.make("10.2.15.0",24) 


25 select NETWORK_IPV4_STATIC_ADDRESS 


26 where NETWORK IPV4_STATIC_ADDRESS.interface 


28 assignment 
29 -- PC_KLAS external router 


30 address !!IPV4_ADDRESS.make("10.2.15.254") 


31 select NETWORK_IPV4_STATIC_ADDRESS 


32 where NETWORK IPV4_STATIC_ADDRESS.interface 


34 assignment 
35 -- KULEUVENNET external access 


:!!IPV4_ADDRESS.make("134.58.47.255") ] 


.labels.has("PUBLIC_SUBNET") 


:! !IPV4_ADDRESS.make ("192.168.255.255") ] 


. labels. has ("PRIVATE _SUBNET") 


.labels.has("PC_KLAS") 


.device.name = "lswitch-cw" 


36 subnet !!IPV4_NETWORK.make("134.58.254.64",29) 


af select NETWORK_IPV4_STATIC_ADDRESS 


38 where NETWORK_IPV4_STATIC_ADDRESS. interface. labels.has ("KULEUVENNET" ) 


40 assignment 


4l -- Gateway configuration for KULEUVENNET 


42 address !!IPV4_ADDRESS.make("134.58.254.70") 


43 select NETWORK_IPV4_STATIC_ADDRESS 


44 where NETWORK _IPV4_STATIC_ADDRESS.interface.device.name = "default-gateway" and 
45 NETWORK _IPV4_STATIC_ADDRESS.interface.labels.has ("KULEUVENNET" ) 


Listing 14: Policy Specification for the network configuration of K. U. Leuven’s CS department. 
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is object-oriented, as is the case with the PoDIM do- 
main model. Policy rules are Event-Condition-Action 
based in these languages. The action that can be exe- 
cuted is an arbitrary operation of the domain model. 
Notice that differs with our approach. Our approach is 
less expressive in the sense that we do not allow the 
execution of arbitrary operations. However, we do al- 
low to model constraints on attributes, which is some- 
thing that is not supported by the current generation of 
policy languages. 

Model Finding 


In [35] a model finding approach is proposed 
for configuration management based on Alloy [24, 
25, 1]. This approach is based on creating a model 
for an infrastructure based on first-order logic. Based 
on a number of inputs (such as the number of de- 
vices and network interfaces) an outcome is con- 
structed that satisfies the model. The advantage of 
using a tool such as Alloy is that it allows very ad- 
vanced reasoning over a configuration. The same 
model can be used to generate and validate configu- 
rations. The limitations are that constraints, as we 
discussed them in this paper, can not be modeled. It 
is also difficult to set specific attributes of ‘“‘things”’. 
For example, it is difficult to set a human readable 
name for every device. 


Application Deployment Frameworks 


Application deployment frameworks like Smart- 
Frog [29, 28], Spring [38] and JBoss Microcontainer 
[22] manage applications directly by tuning their pa- 
rameters. Notice the difference with PoDIM: classes 
directly control real world things, while PoDIM class- 
es representate real world things that are deployed 
when a stable state is reached. Because of this differ- 
ence, application deployment frameworks listed above 
do not provide constraint resolution of the kind that 
PoDIM uses. 


Future Work 


There are various areas where PoDIM could be im- 
proved. We already mentioned the stable state problem: 
in some cases, it is impossible to reach a stable state and, 
as a consequence, generate a valid configuration. It is al- 
so possible that, on different runs of the translation con- 
troller, different configuration files are generated for the 
same input rule sets. This is because of the possibility 
that classes define random choices for choosing a value 
from a constraint set. Including information about previ- 
ous runs could solve this problem. 


A dry-run or analysis mode would also be useful. 
Currently, the prototype supports the actual translation 
and deployment process. Checking the validity of a 
rule set requires a simulation modus. Ponder, for ex- 
ample, uses Event Calculus [9] to do this. 


We have not yet gathered data on the scalability 
of our prototype. In its current implementation, the 
translation from rules to configuration files is done on 
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one central component. We are currently working on a 
decentralized version of the translation controller. In 
the decentralized version, each device can be made re- 
sponsible for generating its own configuration and will 
need to communicate with other devices to find a 
globally valid state. 


Other areas for improvement deal with usability. 
e Extending the domain model requires knowl- 
edge of Eiffel. In many cases, a less expressive 
(and easier) notation suffices for creating new 
classes in the domain model. A program can 
then translate classes to the Eiffel notation. 
There is no support to track configurations 
throughout the translation process. Therefore, it 
is impossible to know what rules influence the 
generation of a configuration files or what 
changes in a configuration file are caused by a 
change in one of the rules. Both would be use- 
ful for debugging rules. In general, the transla- 
tion controller must be able to explain why a 
configuration is generated. 
The rule language contains no structuring mech- 
anisms. Working with large policy sets becomes 
cumbersome. Existing preprocessor systems can 
solve part of this problem, but specifying meta- 
rules will stay inconvenient. Instead of looking 
at PoDIM’s rule language as a user interface, it 
is better to see it as a target format for more ad- 
vanced interfaces which could be, depending on 
the case text-based, command-line programs, 
graphical user interfaces, web interfaces, and so 
on. 


Conclusion 


PoDIM tries to address the complexity of config- 
uration management by abstracting from system vari- 
ety and providing mechanisms for specifying cross 
machine constraints. The PoDIM language consists of 
a rule language and an extensible domain model. The 
current prototype translates high-level rules in low- 
level configuration files and subsequently uses exist- 
ing configuration management tools to deploy the gen- 
erated configuration on all managed devices. We be- 
lieve that PoDIM provides an advancement of the 
state of the art. It provides a higher-level specification 
compared to what is currently available, which in- 
cludes cross machine constraints and abstracts away 
systems variety. 
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ABSTRACT 


Network patterns are based on generic algorithms that execute on tree-based overlays. A set 
of such patterns has been developed at KTH to support distributed monitoring in networks with 
non-trivial topologies. We consider the use of this approach in logical peer networks in cfengine as 
a way of scaling aggregation of data to large organizations. Use of ‘deep’ network structures can 
lead to temporal anomalies. We show how to minimize temporal fragmentation during data aggre- 
gation by using time offsets and what effect these choices might have on power consumption. We 
offer proof of concept for this technology to initiate either multicast or inverse multicast pulses 


through sensor networks. 


Introduction 


In this paper we consider an approach for scaling 
data dissemination (e.g., for configuration manage- 
ment) or alternatively for scaling data aggregation 
(e.g., for monitoring or archiving) by implementing 
Network Patterns on top of cfengine’s pull-based copy 
methods. This follows up preliminary work on scaling 
in [1, 2] and Voluntary Cooperation [3] and is inspired 
by work on the Generic Aggregation Protocol (GAP) 
described in [4, 5, 6]. 

Consider the sharing of load in a multicast process 
by handing off parts of a task to decentralized process- 
ing. For example, in a distributed backup scheme, one 
could imagine assigning responsibilities such that local 
nodes collected and compressed their own data before 
passing them from the leaves of a tree to their parent 
node; the parent would then aggregate data from all of 
its children and adds its own data, and so on up the tree 
to a final repository. By introducing several tree levels 
one reduces the total computational burden on the fi- 
nal host. Such a strategy could be useful in either a 
fixed infrastructure network (where nodes have limit- 
ed computational power) and especially in battery 
powered processors such as wireless ad hoc devices, 
sensor networks, and so on. 


Network Patterns are based on generic, distrib- 
uted algorithms that execute on spanning trees, de- 
signed to collate information from a topologically con- 
strained network, such as a fixed routing infrastructure 
or ad-hoc substrate. They employ the basic structures 
used in routing and switching, like spanning trees, and 
can adapt to node or link failures [4, 5, 6]. We shall 
consider only aggregation algorithms here, where ag- 
gregates of local variables across a domain of local de- 
vices are computed using functions, such as sum, max, 
or average. 


The overlay networks are usually created under 
some basic physical constraints such as geography, 


physical network design, allowed access, or even by 
wireless power limitations in an ad hoc network. In oth- 
er words, certain branches and levels in the tree could 
be forced into the final topology by physical circum- 
stances, hence one could not merely choose the sim- 
plest star topology for the task, even if it were not an 
unacceptable burden on the single bottleneck. However, 
we can also ask whether it makes sense to build such 
structures even where there are no constraints, such as 
local area networks with underlying star topology. 
There are valid resource sharing reasons for doing this 
in system administration, especially where resources 
are limited. 


Network patterns allow a kind of load balancing, 
but they are different from the kind of service balancer 
which one might use on a web server: a traditional 
load-sharing dispatcher acts like a switch, taking a sin- 
gle input stream and offloading it to a separate queue: 
in a network pattern data are sent to all branches, like 
a “smart” multi-port repeater or amplifier/aggregator. 


Inter-Domain Management and Voluntary Cooper- 
ation 


A subject that is increasingly discussed in to- 
day’s world of cooperative outsourcing is the issue of 
inter-domain management. In the extreme case, each 
node in a network is in its own administrative domain 
(this is approximately true for border routers, for ex- 
ample, as well as hand-held devices). Inter-domain 
management involves many issues that are often ig- 
nored in discussions of system administration. For ex- 
ample, we do not typically have privileged access to 
all of the devices we communicate with. The concept 
of Voluntary Cooperation was introduced to discuss 
“minimal trust’ interactions with autonomous do- 
mains [3]. 

Even a wireless ad hoc network of personal elec- 
tronic devices (or a military network deployed in the 


21st Large Installation System Administration Conference (LISA ’07) 275 


Network Patterns in Cfengine and Scalable Data Aggregation 


field) could be formed from many devices with different 
privileges and privacy policies. Traditional models of 
centralized control do not begin to address these issues. 


Monitoring (data collection) from a network of 
sensors (either in a fixed infrastructure net or in a 
wireless environment) is an application that has re- 
ceived a lot of attention. This is because “network 
management” has traditionally been about watching 
network traffic data. Even today as vendors advocate 
the virtues of autonomic computing, network man- 
agers still want to watch the automation in progress. 
Thus the problem of distributed aggregation with un- 
clear domain relationships is still at the heart of net- 
work management. 


Cfengine is a management system that represents 
state of the art research on integrating monitoring and 
reactive (“autonomic”) management of computers. In- 
tegrating network patterns into cfengine would allow 
distributed monitoring and management of a manifestly 
autonomic system with any chosen degree of central- 
ization or decentralization. Cfengine is designed to be 
able to work in mobile, partially connected environ- 
ments. It is an ideal testbed for exploring the usefulness 
of patterns in host based system administration. More- 
over, eventually it is expected that cfengine will be able 
to manage routers and switches for which patterns were 
originally envisaged. 

Network Patterns are not generic routing or 
switching structures, although they share similarities. 
They are designed to execute any computation whose 
data can be represented on the underlying graph. This 
typically involves aggregation, dissemination, maxi- 
mization or.minimization etc. Here we use them only 
for the simplest aggregation of data from every node 


width 


depth 


Figure 1: Depth and width in network patterns formed from promises. 
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in a network to an arbitrary but central place. They are 
therefore used to initiate either multicast or inverse 
multicast pulses through sensor networks. 


Any collection of “sensor devices” that can run 
on a GNU/Linux platform could use cfengine in the 
way we demonstrate here, and this accounts for an ev- 
er increasing number of devices available today. One 
application is for collecting and correlating data from 
around a network from cfengine’s own sensor compo- 
nent cfenvd. Cfengine’s investment in methods of vol- 
untary cooperation means that one need not give away 
privileges in order to implement patterns, hence risk- 
ing or sacrificing security. This makes monitoring of 
large an fragmented organizations an easier process to 
swallow for security officers (the alternative being to 
open firewalls to unspecified network pushes). In- 
creasingly companies are outsourcing their systems in- 
to different formal domains with their own policies 
and barriers. The fact that one can make patterns work 
with voluntary cooperation is therefore itself a valu- 
able proof of concept. 


A natural application for this kind of process is 
for monitoring grid systems. These are systems that 
are often geographically distributed and already form 
part of some organized structure. Patterns at the level 
of host based monitoring would allow grid administra- 
tors to view the performance characteristics of the 
component systems or even aggregate results from 
them with controllable accuracy. 


There are various other applications for data ag- 
gregation to a point. Another one is to perform a dis- 
tributed backup, collecting and compressing data as 
they propagate up the tree. This would offload the 
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burden of performing the compression, and data could 
be encrypted with local keys before compression. We 
shall not elaborate on these applications here, but sim- 
ply present these tests as proof of the concept. 


Some Patterns 


The patterns discussed here are dissemination 
and aggregation algorithms that bridge the worlds of 
centralized monitoring and fully distributed monitor- 
ing. They are built from “‘component”’ pieces that rep- 
resent the extreme cases of any network structure: 
chain (for maximum depth) and the star topology (for 
maximum width), see Figure 1. 






val(D) val(E) 





val(F) val(G) 





agg(C) = 
val(F)+val(G) 
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Trees are structures that bridge these two ex- 
tremes. We can characterize patterns by their depth 
and breadth. Note that a chain is also geometrically a 
half-ring, so it gives us a basic model for ring-topolo- 
gies also. 


Here we consider only two patterns: Echo and 
GAP and consider how these can be implemented in 
cfengine using existing context awareness within the 
system. 


Echo 


The simplest example of a network pattern is the 
echo pattern [7, 8]. During its execution, echo creates 


agg(A) = 
agg(B)+agg(C) 
+val(A) 


Figure 2: The expansion and contraction steps in the echo pattern. The pattern is a sequence of “pulls” initiated by 


“pushed” signals. 
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a spanning tree topology, with root of the tree chosen a 
by an administrator. The pattern has two phases of 
communication: expansion and contraction (see Fig- 
ure 3). During the expansion phase, the root node is- 
sues a query to its children. Each node in the tree re- 
peats this process. The contraction begins as the query 
reaches a leaf node. The leaf node answers the query, 
sending its response to its parent in the tree. The par- 
ent receives the response of its children, aggregates or 
calculates information for the query to the fullest ex- 
tent possible, and then sends a single aggregate answer 
to its own parent. This process is repeated recursively, 
until the root node is reached, which aggregates the 
messages from its children. The tree topology pro- 
vides for parallelized execution, while the aggregation 
of query responses during contraction reduces the 
amount of traffic that would otherwise be necessary. 
The echo pattern therefore forms a wave, spreading 
out from the root to the edge of the network and back, 
collecting data as it progresses. Echo is intrinsically a 
“push” protocol, and is easily understood as a recur- 
sive descent parser. 


GAP 


Similar to Echo, the Generic Aggregation Proto- 
col (GAP), creates a spanning tree along which com- 
munication and computation takes place. Unlike echo 
however, data in GAP are passed from the leaves of 
the tree towards the root, whenever the local variable 
in one of the nodes changes. Updates to monitored ag- 
gregates can thus be initiated by any node, not only by 
the root node. Thus GAP, one initialized, responds to 
local events rather than initiating measurements from 
a central observer. 


GAP is an asynchronous distributed protocol that 
builds and maintains a Breath First Search (BFS) or a 
spanning tree over which aggregates are computed in- 
crementally and continuously. The tree is maintained 
in a similar way to the algorithm underlying the 
802.1d Spanning Tree Protocol (STP). In GAP, each 
node maintains a table of its peers and especially its 
nearest neighbours along with an estimation of the 
nodes’ aggregate values. GAP is event driven, in the 
sense that each update from a leaf node triggers a cas- 
cade of events through the tree branches, updating the 
local aggregates as it goes. Update events can be trig- 
gered by changes in topology, loss of a node, a timer, 
etc. 


The advantage of GAP over echo is that there is 
no “push phase” required to initiate a reading of the 
values from the network. As each change occurs in the 
network, new values can be percolated back to the 
centralized root node initiating an update only in those 
tree nodes that are in the path to root. This avoids the 
need for much unnecessary traffic and computation 
during updates. 


The Topology Manager 


A key feature of the patterns above is the algo- 
rithm by which the topology of the spanning tree is 
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decided. The GAP algorithm incorporates the topolo- 
gy adjustment mechanism into the GAP aggregation 
algorithm, by using nearest neighbour communications, 
hence combining these features into a robust protocol. 
However, they can be separated also. The GoCast algo- 
rithm finds such a spanning tree, for example. At this 
stage of the work we shall not attempt to encode auto- 
mated topology management, as this requires additional 
subsystems. Rather we consider how patterns can be 
used at the logical level for distributed load balancing, 
using existing mechanisms within cfengine. We note 
however that cfengine has implemented peer neighbour 
management functions for some time in the form of the 
SelectPartitionNeighbours 
SelectPartitionLeader 


functions. These functions take a flat list of all known 
hosts and partition this list into clusters of a specified 
size. Each cluster is assigned an identified leader 
which can be used to single out a root or responsible 
node for each group, and in this way any host can au- 
tonomously be made aware of its nearest neighbour 
topology based only on the shared information of the 
flat list. These functions, or functions like them could 
in principle be used to provide an implementation of 
GAP topology management in future, for automatical- 
ly adaptation for fault tolerance. However, we shall 
not pursue the details of the topology here, since it 
turns out that the implementation of patterns throws 
up a number of issues that are more fundamental. 


The problem of building soft-overlays for com- 
putational load sharing is slightly different to the prob- 
lem of finding a spanning tree through a physically re- 
dundant topology however. In principle, any kind of 
overlay could be built in software, but physical con- 
straints can limit the potential optimizations. What we 
find interesting in a cfengine environment is that we 
must deal with a combination of these issues. If 
cfengine is used in a simple star network, any kind of 
overlay can be built. However, if it is used for inter- 
domain management, or between zones with different 
administrative regimes, then these amount to essential- 
ly physical constraints. 


Cfengine Principles and Patterns 


By implementing network patterns in cfengine we 
hope to achieve two things: i) an efficient way of aggre- 
gating data for centralized analysis and decision-mak- 
ing, and ii) open for the possible load sharing optimiza- 
tions that are possible with patterns. An obvious goal for 
centralized decision-making would be to use this to 
build an “autonomic nervous system” from cfengine’s 
autonomous agents so that centralized monitoring and 
decision-making can be added to its local stimulus-re- 
sponse approach to management. Although many con- 
figuration management schemes boast “centraliza- 
tion”, this can often be seen as a weakness, as it is a 
clear limitation on scalability, and such systems usual- 
ly only disseminate data from a centralized source: we 
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are advocating stimulus-response in a distributed sys- 
tem, something like a central nervous system. While 
individual machines work autonomously, we collect, 
process and return data to the nodes on a continuous 
basis. 


The desired model is not without its own chal- 
lenges however: cfengine maintains strong principles 
of autonomy that are largely responsible for its record 
of security and reliability. The challenge is to imple- 
ment aggregation/dissemination patterns without sac- 
rificing those strong principles. 


A cfengine host is, by default, a completely au- 
tonomous entity with no obligations towards other 
agents in a physical network. Every node is therefore 
individual and is not part of a pattern a priori. Leaf 
nodes cannot initiate a push of new data in response to 
events, because the parent node does not accept data 
from any outside source, unless it explicitly pulls the 
data itself. To use patterns as a form of inter-peer col- 
laboration, we must encode them as policy rules that 
are compatible with cfengine’s pull-only principle of 
communication. There are several questions to be an- 
swered about this: 

¢ Is the underlying physical network topology 
important in building a logical load sharing 
topology? 

¢ How will the topology be decided? 

¢ How will the topology respond to the failure of 
nodes? 


We shall not be able answer all of these ques- 
tions, but we present the basic approach to building 
GAP-like patterns using cfengine’s internal mecha- 
nisms, and provide tools for readers to experiment on 
their own. 


Periodic Execution 


Cfengine is normally used for regular (periodi- 
cally) scheduled maintenance sweeps, yet the tradi- 
tional idea of a network probe is to ask a question and 
get back the answer on demand (as with probes like 
ping and traceroute). The Echo pattern is a “push-me 
pull-you” strategy for connecting to all elements in a 
managed network and transmitting or collecting data: 
a kind of broadcast ping. The principal advantage of 
this kind of approach is that the timing of the distrib- 
uted process is event driven. It does not require an 
elaborate clock synchronization and timed firing to co- 
ordinate the distributed execution, since the interac- 
tions are themselves synchronous. However, it is in- 
herently fragile as it involves the privilege to push and 
collect through a chain of dependencies. If the top 
node loses communications with its children, none of 
the network operations will be executed. A better ap- 
proach would be allow all nodes in the network to op- 
erate autonomously and have them cooperate when 
they are able. 


A typical cfengine approach to the problem to ex- 
ecute the distributed agents periodically (with period P, 
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anything from a few minutes to an hour). Neighbour- 
ing cfagents could download from their children servers 
to aggregate the results, but now the timing plays a role. 
Since there is no push possibility to coordinate the op- 
erations, the process is fragile to time coordination [9]. 
There are two issues: 1) clock synchronization and ii) 
clock schedule for ensuring the data are updated in time 
before the data values are pulled downstream. If either 
of these requirements is not met, data that are pulled 
will be out of date and will not give an accurate repre- 
sentation of the true values. 


So what happens if the nodes are not properly 
synchronized? Since cfengine operates autonomously 
and its copying is fault tolerant, a missed update could 
simply be captured at a later time. This might not 
seem like a problem, unless one begins to measure the 
spread of times in the “current” data. The situation is 
somewhat analogous to asking post office branches to 
report to their head office on how many customers 
they have each day using their own postal delivery. At 
any given regular delivery, the letters that arrive at the 
central office have a variety of postmarks. Some of 
them are delivered on the same day, and some of them 
take perhaps a week to deliver. Thus updates might ar- 
rive eventually, but how shall we understand the re- 
sults that arrive? Do we group letters by their post- 
marks and only combine results that were originated 
on the same date? Or do we ignore the post-marks and 
combine data that were received on the same date? In 
the first case, we might have to wait a long time for 
the data, but we are certain of what we are seeing. In 
the latter case, the result is available quickly but the 
meaning of the data is in question. 


Each hop in a chain of delivery adds new possi- 
bility for delay. If the mail does not arrive before one 
post office sends its own delivery, the incoming mail 
will have to wait a whole day for the next delivery (a 
whole scheduling period P). A single failure could not 
bring down the entire system, but it could skew the 
impression received at the central monitoring station. 
It is therefore advantageous, if not imperative, to de- 
velop patterns that do not have this strong dependency 
feature. 


To avoid the dependency and delay problem, we 
based our work on the assumption of time synchroniza- 
tion. As we shall see, even this is susceptible to noise. 
Apart from a proof-of-concept implementation, we did 
not pursue the echo pattern for this reason (in spite of 
its ready comprehensibility) and instead were inspired 
by the Generic Aggregation Protocol (GAP) approach. 
For GAP we shall not attempt a complete implementa- 
tion, but rather emulate its operation as a first step to 
making progress. GAP includes an algorithm for auto- 
matic renegotiation of the structure. This has several 
implications which require some soul searching when 
implementing in cfengine. Further research by KTH 
based on the cfengine experience can also help to adapt 
the GAP algorithm for pull-based scenarios. 
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Promise Agreements and Voluntary Cooperation 


The notion of promises was introduced as a way 
of modeling networks of agents cooperating in an ad 
hoc fashion. Cfengine can be viewed as a reference 
implementation of the abstract promise-theoretic sce- 
nario. Promise theory was introduced precisely as a 
modeling framework that could describe cfengine, 
where others could not. 


Promise theory is a high level graphical descrip- 
tion of constrained behaviour in which ensembles of 
agents document the behaviours they promise to ex- 
hibit. Agents in promise theory are truly autonomous, 
i.e., they decide their own behaviour, cannot be forced 
into behaviour externally but can voluntarily cooper- 
ate with one another [10]. A promise is a directed edge 

Ho: eid (1) 
that consists of a promiser 4; (sender), a promisee Aj 
(recipient) and a promise body b, which describes the 
nature of the promise. Promises made by agents fall 
into two basic categories, promises to provide some- 


thing or offer a behaviour b (written A, jh, A>, and 
promises to accept something or make use of another’s 


promise of behaviour b (written 4, ——> A)). A suc- 
cessful transfer of the promised exchange involves 
both of these promises, as an agent can freely decline 
to be informed of the other’s behaviour or receive the 
service. 


The essential assumption of promise theory is 
that all nodes are independent agents, with only pri- 
vate knowledge (e.g., of time). No node can be forced 
to promise anything or behave in any way by an out- 
side agent. Moreover, there are no common standards 
of knowledge (such as knowing the time of day) with- 
out explicit promises being made to yield this infor- 
mation from a source. This viewpoint fits nicely with 
our view of collection of distributed information for 
measurement purposes. 


We shall consider the following promise designa- 
tions: +d server provides data, —d client receives/uses 
data, +a branch node aggregates data, +f server pro- 
vides time/clock, and —t client uses time/clock. Al- 
though we speak mainly of network nodes below, it 
will be understood that each node is modeled as an 
“agent” in promise theory parlance. 


Promise theory allows us to see the relationship 
between network patterns and policy for autonomous 
agents. Each arrow in the promise graph attaches to a 
rule in the policy to either grant access to data or to 
fetch available data. In this way we can build dissemi- 
nation processes over graphs using node location data 
or context sensitivity information. 


A common mistake is to think of promises as 
communication transactions, rather than as abstract 
behavioural specifiers. A promise says nothing neces- 
sarily about the details of what is communicated be- 
tween agents at a given moment, only that it intends to 
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behave within the confines of its promise. However, 
one usually assumes that a promise means a best effort 
to comply with the announced constraints and that no 
promise means that nothing will happen. A reliable 
binding between two hosts requires both a promise to 
serve and a promise to use the promised service. 


A, 25 Ap, dy — vA, (2) 


The Echo and GAP patterns are particularly well 
suited to implementation using voluntary cooperation, 
because the propagation of data along tree-like path- 
ways does not depend strongly on whether data are 
pushed or pulled. The main challenge in a voluntary 
cooperation scenario is for an agent in the graph to 
know when its child has data waiting. When data are 
pushed, we essentially send a signal “do it now”, and 
no other time synchronization is required. This be- 
comes more complicated in a pull regime however. 
Regular polling of a host’s servers is an obvious an- 
swer to the question of when to download data. If 
clocks in the network are synchronized correctly we 
can even ask for data to be copied only if they have 
been updated since the last copy. However, this re- 
quires the extra overhead of time synchronization and 
it still does not guarantee that data will be ready for 
collection at a given moment. 


This issue becomes most pronounced when one 
attempts to request regular pollings of data and the 
time for data to propagate through the network ap- 
proaches the time interval for the polling. We have 
discussed this issue in a separate paper [9], but some 
of the effects can be seen in Figures 5 and 6. 


Using Context Awareness for Making Network Pat- 
terns 


Cfengine agents are aware of location and con- 
text through their evaluation of the environment into a 
set of classes. These classes are then used as Boolean 
flags to attach policies conditionally to scenarios. This 
context sensitivity enables a set of distributed promis- 
es to be coded into a single document. 


A method in cfengine is like a pair of promises, 
provided it is voluntarily declared by both parties. An 
MDS hash is used to verify that the methods are in fact 
the same. 


The first (service) promise identifies the function 
being performed, as the body b(). The class expression 
A_1:: says that this rule applies to the context of agent 
A-1, which is the service provider (server host). The 
server=A_1 attribute matches the context expression 
and, from this, the agent deduces that it is the provider. 


methods: 
Alas 
b(params) server=A_1l 
Ay 224 Ay hy Oe Ay (3) 
The second part applied to agent A, and has the 
form: 
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methods: 
B28 
b(params) server=A_1 


This identifies the function being performed and 
signals to A, that it will use the results performed by 
server A,. Since this is not its own identity, this im- 
plies that the result is a use-promise. 


If we assume that two agents use an identical 
configuration specification, then a remote procedure 
call binding can then be written methods: 

A_1|A_2:: 
b(params) server=A_1 


The same text either in both contexts and a single 
link in a logical overlay network is added. 


Echo 


Cfengine’s modus operandi is to “pull” data 
rather than to push. This is a natural side effect of its 
philosophy of voluntary cooperation. Push is disal- 
lowed, with one exception: we are allowed to send a 
single invitation to each peer to execute its existing 
policy using the command cfrun. The host is free to 
disregard this message, but for cooperation purposes it 
is normal for the peer to respond to such an invitation 
by executing its policy compliance-checking agent. 
We can use this mechanism to start an echo avalanche, 
with a pre-arranged pattern. 


The start host executes cfrun to a number of “chil- 
dren”. Each child then voluntarily executes cfengine, 
which in turn encapsulates the execution of cfrun di- 
rected at another set of children, which encapsulates 
cfrun to another set, and so on. Since cfengine aggre- 
gates the data from encapsulated processes automati- 
cally, it automatically aggregates the entire tree in a 
synchronized manner. This is the simplest implemen- 
tation of echo which uses context sensitivity to identi- 
fy parent-child relationships. 


Both serial and parallel star collation can be per- 
formed in cfengine echo. The difference is that the 
parallel star cfagent.conf issues an individual cfrun com- 
mand to each client in the background. Additionally, 
the output from each of those commands is redirected 
to a file. When all the cfrun processes have finished, 
the output files are concatenated together and printed 
to the terminal so that the parallel and serial star tests 
both provide nearly identical terminal output. Howev- 
er, it should be noted that the parallel star approach, 
involving the use of a separate temporary file for each 
client, involves a great deal more file input and output 
operations than serial star. 


The echo cfagent.conf draws from the same frame- 
work used for the parallel star, e.g., executing cfrun com- 
mands in the background with output redirected to 
files. In this case, a variable is defined for each host 
that has children. The variable contains a list of the 
node’s children in the tree. If this variable is defined, 
cfrun is called for each child node. Therefore the tree is 
statically defined. 
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The use-promises are encoded as follows as in 
Figure 3. 





control: 
actionsequence 
domain = ( cftestnet ) 
IfElapsed =(1) 
TrustKeysFrom 
nodel:: 
serve = ( node2:node3:node4 ) 


node2:: 

serve = ( node5:node6:node7 ) 
node3:: 

serve = ( node8:node9:nodel0 ) 


= (10.0.0 ) 


node4:: 

serve = ( nodell:nodel2:nodel3 ) 
node5:: 

serve = ( nodel4:nodel5:nodel6 ) 
node8g:: 

serve = ( nodel7:nodel8:nodel9 ) 


nodell:: 
serve = ( node20 ) 


classes: 
HasChildren = ( IsDefined(serve) ) 
shellcommands: 
"/bin/echo $(hostname)" 
HasChildren:: 


"/usr/local/sbin/cfrun $(serve) \ 
2>&1 > /tmp/echorun.$$" 


background=true # parallelize 


"/usr/bin/pgrep cfrun > /dev/null; \ 
while [$7 =0];: \ 

do pgrep cfrun > /dev/null; done" 
"/bin/cat /tmp/echorun.*" 


tidy: 
HasChildren:: 
/tmp pattern=echorun.* age=0 
Figure 3: The only kind of push structure that can be 
implemented in cfengine is the echo pattern, using 
nested cfrun commands. These must be autho- 
rized in advance. 





Promise Chains (Forwarding) 


Two implementations of chains are shown in 
Figures 7 and 8 for readers to try. Conventional wis- 
dom suggests that tree depth corresponds directly to 
latency in terms of end-to-end communication; chains 
contain the maximum number of non-repeated hops in 
a topology and therefore the highest latency on mes- 
sages passing from one end of the chain to the other. 
Chains are highly susceptible to failure due to the fact 
that any individual link or node failure can disrupt 
end-to-end communications; the closer the failure is to 
the root, the more substantial the loss. This is a basic 
problem with all structures of significant depth. 


Using a chain length of 20 nodes, we consider the 
periodic execution of cfagent each minute and measured 
the time to propagate data from one end of the chain to 
another, in repeated trials. The result of the completed 
aggregation for this test is a file on the root node contain- 
ing each node’s CPU load average as well as the time at 
which that information was collected. Each node used 
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the cfengine copy action to copy a partially aggregated 
file from its child. Then the node used the cfengine edit- 
files action to append its own load data to the bottom of 
the file. 


The results of the experiments are shown in Fig- 
ure 4. We shall report on a detailed explanation else- 
where. 


The graphs can be understood roughly as fol- 
lows. The solid line shows a prediction based on the 
assumption of regular deterministic behaviour. For ze- 
ro time-delay between receiving and sending in the 
chain the age of the data is about ten periods. This is 
what one would expect by random chance: about half 
the nodes are correctly ordered on average. As the de- 
lay is increased to one minute (greater than noise) the 
noise becomes irrelevant and an optimal number of 
nodes is correctly ordered for direct transmission. This 
gives the fastest result. Then as the delay increases, 
the time increases in steps. If the wait time times the 
length of the chain is greater than a period, then the 
nodes on the period boundary will be out of step and 
will have to wait a whole period to update, hence the 
jumps in the graph. What is interesting is that the ef- 
fect of noise is to improve this handicap. There is no 
room here for a full discussion of this phenomenon, 
but the result is essential to understand for monitoring. 


Promise Trees (Aggregation) 


The chain is an unlikely topology in a real dis- 
tributed system. In most cases one would expect a 
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node to be able to connect to several other nodes and 
allow a greater centralization of data during aggrega- 
tion. We have repeated our experiments for binary 
trees and the results are show in Figure 5. The 
cfengine configuration patterns for these tests are 
shown in the aggregation examples, Figures 8 and 9. 


The data from the tree results are not directly 
comparable to those of the chain, for several reasons. A 
number of scales change when performing local aggre- 
gation and these changes interfere with the time-scales 
of system noise. Understanding the tree results is there- 
fore rather more complicated than understanding the 
chains. The parallel arms of the trees interfere some- 
times destructively for parallelized copy and sometimes 
constructively in serialized copy. Thus our graph seems 
to reveal a relative stability compared to the chain. This 
is slightly misleading however. The same basic behav- 
iour is common to both cases; however, the tree is able 
to delay the onset of temporal instability from chain 
depth (see [9] for more explanation). 


Suppose we put aside the restrictions on topolo- 
gy due to local environment, e.g., the finite range of a 
wireless network, and ask whether there are reasons 
for building a tree with a particular number of neigh- 
bours (node degree) for aggregation or dissemination. 
This question should be answered differently depend- 
ing on who initiates a transmission through the net- 
work, how often and at what relative times. In the 
cfengine model of maintenance in which data are 
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Figure 5: Experimental results for binary tree propagation. 


21st Large Installation System Administration Conference (LISA ’07) 


Burgess, Disney, & Stadler 


sampled at regular intervals, the behaviour of an ag- 
gregation process is something of a cross between the 
GAP protocol and a Gossip approach [11]. The peri- 
odic checking of cfengine promises adds a level of 
complexity to the data quality of the final result. How- 
ever, the synchronization of the binary tree is much 
less sensitive to the size of small offsets than for the 
chain so it would seem to be advantageous to choice a 
tree over a chain. 


Clearly then the tree is more efficient in terms of 
time and the decreased network depth gives more free- 
dom in choosing the synchronization parameters. In- 
creasing the node degree (number of children) in the 
tree increases the processing burden on the aggregator 
in order to maintain the same accuracy of service level 
however. A question therefore presents itself: is there 
an optimal node degree for distributed monitoring? 


Scalability 


Scalability is about how well a system continues 
to perform in all its parts as it grows. The burden of 
size can have a variety of negative effects on a system. 


For scalability, we seek to minimize the time to 
delivery from the leaves of the data structures to the 
roots (i.e., obtain the lowest value on the vertical axis), 
while maintaining meaningful data by minimizing 
temporal fragmentation (partially represented by the 
error bars). Thus we would like to be as close the low- 
er left of the figures as possible. Our results tell us 
something about how to achieve this by adjusting 
overlay topology. 


The two structural poles for the network patterns 
were illustrated in Figure 1: the star pattern for maxi- 
mum parallelization and centralization (hence maxi- 
mum burden per root node) and the chain for maxi- 
mum off-loading and decentralization (hence maxi- 
mum temporal fragmentation). With centralization, the 
fraction of the central node’s capacity that is available 
to its children decreases in proportion to the number of 
clients, so since the capacity is fixed scaling means a 
reduction of workflow on the children proportional to 
their own number [1]. In a chain, every node can use 
its maximum processing capacity on its neighbour and 
that chain can grow as long as we like until the load of 
data aggregation (which grows in proportion to its 
length) becomes a significant burden. 


There are thus advantages and disadvantages to 
each of these structures, with regard to both organiza- 
tion and processing capacity. A tree is essentially a 
compromise between the two: any tree can be seen as 
a number of stars chained together. We must decide as 
a matter of policy what node degree or number of 
branches these stars should have in order to compro- 
mise on these two dipolar effects of growth. 

One interesting example in which the topology of 
a network pattern could have a direct effect on scalabil- 
ity is power consumption. Since we envisage network 
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patterns finding application in mobile ad hoc networks 
which run off batteries, e.g., sensor networks with lim- 
ited resources, we should think about the possibility 
that the choices we make will affect the lifetime of the 
devices. Power consumption too might have to be 
traded against speed and accuracy. 


We have no generic answer to the question of 
which kind of structure is best in a given case, as such 
concems are a matter for policy. However, consider the 
following. The rate of power consumption of a node is 
proportional to its CPU frequency [12] squared. Thus if 
we design at maximum utilization to cope with demand 
from aggregation of k neighbours, we must scale cost as 
k° which represents power, cost of cooling or shortened 
battery life, etc. The risk, on the other hand, associated 
with not getting data quickly is proportional to the effec- 
tive depth of the network pattern (V— 1)/k. So we have 
a cost function that is a balance between these two 


Cost = on? + 5 I) (4) 
A plot for this for the arbitrary policy a=0.1 is 
shown below. This shows the existence of an optimum 
aggregation degree, in this case k = 5. If k were a con- 
stant all over the network, i.e., the network formed a 
regular graph, this would be the optimal answer for 
minimizing power consumption. However, there are 
many constraints in ad hoc networks that would make 
it unlikely to be able to maintain such a regular tree, 
moreover there are other concerns than power con- 
sumption. In general one must compromise between a 
number of different optimization parameters compet- 
ing for attention. More detailed considerations then 
need to be applied to the problem. As we see, the cost 
rises sharply with increasing centralization, however 
this does not help roaming hand-held devices with 
limited range that both cannot centralize and do not 
want the computational burden focused in one place. 
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Figure 6: Cost considerations can plausibly lead to an 

optimum depth of network pattern when power 

considerations are taken into account. The mini- 

mum cost here is given for k = 5. Such considera- 

tions require an arbitrary choice to be made about 
relative importance of factors. 
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Our work here does not offer a simple answer to 
this conundrum, but shows network managers how to 
investigate and locate their own compromise as a mat- 
ter of policy. 


Conclusions 


In the present work we have provided a proof of 
concept for implementing network data aggregation 
and dissemination patterns at the host level, using 
promise theory inspired methods. We have shown that 
we can avoid scalability bottlenecks only at the ex- 
pense of temporal fragmentation of data. If users make 
logical star networks, they will have the greatest level 
of certainty about their data but the most fragile archi- 
tecture in the face of growth. If they choose a number 
of star topologies chained together they can make a 
suitable compromise. Most importantly, we point out 
that the uncertainties incurred should be monitored 
and presented as part of the data’s time-stamps. 


We feel that our hybrid network/system study is a 
stepping stone towards integrating host and network ad- 
ministration within a common framework. Our work 
has been based on KTH’s distributed protocols, and our 
investigation must be seen as tentative. We have not 
implemented all the features of the GAP protocol here. 
The adaptive creation of a network overlay is a topic 
for a later time, nevertheless some experimental peer to 
peer features of cfengine are already similar to the ideas 
used in GAP, and we intend to explore these further. 
Some partial approximations for this are implemented 
as SelectPeerNeighbours, SelectPeerLeader functions in 
cfengine, with failover options. However, the full de- 
tails of the algorithm still have to be understood. This 
will probably take another six months to a year to find 
the time to complete. Tests are proceeding and will 
drive a discussion as to the most appropriate way for 
deciding a topology in a cfengine peer network. 


Our microscopic investigation of propagation un- 
certainty in [9] shows that distributed structures lead 
to uncertain results. The uncertainties measured in a 
cfengine network are not simply related to errors in 
aggregation due to unreliable nodes, as studied in [5, 
11], so it is not clear whether the generalization A- 
GAP would be a realistic solution to the problem here. 


The syntax of cfengine’s voluntary cooperation 
model is based on peer to peer interactions, just like 
promise theory. It was designed with simple one-to- 
one contracts in mind. We did not consider the possi- 
bility of widespread interconnection of contractual re- 
lationships. This results in clumsy and cumbersome 
policy files for encoding patterns in cfengine. Further 
work is expected to be able to enable regular expres- 
sions of some form to more efficiently encode the bi- 
lateral promises required for pattern policies. 


As we write this, the team at Stockholm has de- 
veloped a new pattern which they refer to as MGAP, 
in which every node in a structure can receive a copy 
of the total aggregate. It seems likely that this pattern 
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will find a special place in cfengine for extending 
cfengine’s peer to peer monitoring capabilities. We 
look forward to reporting on this is future work. 


This work is supported by the EC IST-EMAN- 
ICS Network of Excellence (#26854). 
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Appendix: Examples 
aia ial 


## CHAIN 4 machines 1,2,3,4 (promise chain) 


t 
{HAHAHAHAHAHA AAA AAA 
classes: 

always = ( any ) 


leaf = ( node4 ) 
root = ( nodel ) 


{HEHHHHHHAAA AAA AAA AAA AAP 
control: 

workfile = ( "/tmp/chain-pattern" ) 
{HEHHHHAHAAAA AAA AAA AARP 
methods: 


i 
# Pattern has to be coded in classes (from) 
## and servers (to) 
# 
nodel|node2:: # -b | +b - binding 
Aggregate("$(workfile)") 
server=node2 
action=method_pattern.cf 
returnvars™ret 
returnclasses=chain_link 
node2|node3:: 
Aggregate("$(workfile)") 
server=node3 
action=method_pattern.cf 
returnvars=ret 
returnclasses=chain_link 
node3|node4:: 
Aggregate("$(workfile)") 
server=node4 
action=method_pattern.cf 


returnvars=ret 
returnclasses=chain_link 


{HAHAHAHAHAHA ARH 
editfiles: 

!leaf:: 

{ $(workfile) 


AutoCreate 

EmptyEntireFilePlease 
AppendIfNoSuchLine "$(Aggregate.ret)" 

## Handle errors so no strange loops 
ReplaceAll "Aggregate.ret" With "FAILED" 
} 
leaf:: 

{ $(workfile) 


AutoCreate 

EmptyEntireFilePlease 
AppendIfNoSuchLine "S$ (value_loadavg)" 
} 


{HHH AAA 
alerts: 
root.Aggregate_chain_link:: 


"Chain aggregate $(n)$(host)=$(value_loadavg) 
at $(date) $(Aggregate.ret) " 


Figure 7: A promise chain fully represented as a contract between parties by voluntary cooperation. 
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SHEER REE HEE 
A Netlab config 
{HAHAHAHAHA HARA AAA AAA AAA AAA 


classes: 


leaf = ( netlab4 ) 
root = ( netlabl ) 


{HHH HHH HHH HHH HH HHH 
control: 
workfile = ( "/tmp/chain-pattern" ) 
tempfile = ( "/tmp/chain-temp" ) 
netlabl:: 
serve = ( netlab3 ) 
netlab3:: 
serve = ( netlab4 ) 
{HAHHRHARAAA AAA AAA AAA HAHA 
tidy: 
{HEHHHHHAAABAAAAAAAHA_AAAA_RAARAHAAAAAAAAA AE 
copy: 
!leaf:: 
$(workfile) 


dest=$ (tempfile) 
server=S (serve) 
type=checksum 
define™success 
elsedefine=failure 


{HAKAHAAAA AAA HE 
editfiles: 
success:: 
{ $(workfile) 


AutoCreate 
EmptyEntireFilePlease 
InsertFile "$(tempfile)" 


AppendIfNoSuchLine "copy-chain $(host)=$(value_loadavg) at $(date)" 
} 


failure:: 
{ $(workfile) 


AutoCreate 
EmptyEntireFilePlease 


AppendIfNoSuchLine "copy-chain - no response from $(serve)" 
AppendIfNoSuchLine "copy-chain $(host)=$(value_loadavg) at $(date)" 
} 


leaf:: 
{ $(workfile) 


AutoCreate 
EmptyEntireFilePlease 


AppendIfNoSuchLine "copy-chain $(host)=$(value_loadavg) at $(date)" 
} 


{HHHABHAAAA AAA AAA AAA E 
alerts: 
success:: 


"Chain update succeeded" 
PrintFile("$(workfile)","6") 


failure:: 
"No Chain update at $(date)" 


Figure 8: A simplified version of the promise chain built using a simple pull method. This is much more trusting 


than the previous example and assumes a certain control over the chil 


dren. 
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[Penetcearenrnennsenentaehsessranaskee ne eenir 
## Depth aggregation (promise tree) 


it 
Ti 


classes: 
leaf = ( netlab3 netlab4 ) 
aggregator = ( netlabl ) 


HiME 


control: 
workfile = ( "/tmp/chain-pattern" ) 
children = ( 

A(netlabl,"netlab3,netlab4") 

A(netlab3,"netlab3,netlab4") 

A(netlab4,"netlab3,netlab4") 

) 
{HAHAHAHAHA AAA AAA AE 
methods: 
netlabl|netlab3|netlab4:: # 2 servers, 1 client 

Aggregate("$ (workfile) ") 
server=$ (children [$ (host) ] ) 
action=method_pattern.cf 
returnvars=ret 
returnclasses=chain_link 
{HAHAHAHA ARAB AAA 
editfiles: 
aggregator:: 
{ $(workfile) 
AutoCreate 
EmptyEntireFilePlease 
AppendIfNoSuchLine "$(Aggregate_l.ret)" 
AppendIfNoSuchLine "$(Aggregate_2.ret)" 
## Handle errors so no strange loops ' 
ReplaceAll "Aggregate.*ret" With "FAILED" 
} 
leaf:: 
{ $(workfile) 
AutoCreate 
EmptyEntireFilePlease 
AppendIfNoSuchLine "$(average_loadavg)" 
} 
{HHHHAHAHAB AAA 
alerts: 
aggregator. (Aggregate_1_chain_link|Aggregate_2_chain_link):: 
"Chain aggregate $(n)$(host)=$(average_loadavg) at $(date) \ 
$(Aggregate_l.ret) $(Aggregate_2.ret) " 


Figure 9: A two to one aggregation of text data. This example uses a full promise approach. 
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{IR EE HH 
, Breadth aggregation by pull 
{HHHHAKAAAAAA AAR AAA HE 


classes: 


leaf = ( netlab4 netlab3 ) 
root = ( netlabl ) 


/HHAAAHABAKAB AAA E 
control: 
Split =f oJ 
workfile = ( "/tmp/chain-pattern" ) 
tempfile = ( "/tmp/chain-temp" ) 


i 1 
## One link in a binary tree / \ aggregation 
3 4 


netlabl:: 

serve = ( "netlab3,netlab4" ) 
#HHHHHHHBKAAAAAARARRAAAAAA AARP 
copy: 
!leaf:: 

$ (workfile) 


dest=$ (tempfile) $(this) 
server=S (serve) 
type=checksum 


define=success 
elsedefine=failure 


{HH HAHAAAAAA AAA AARP 
editfiles: 
success:: 
{ $(workfile) 


AutoCreate 

EmptyEntireFilePlease 

InsertFile "$(tempfile) $(serve)" 

AppendIfNoSuchLine "copy-chain $(host)=$(value_loadavg) at $(date)" 
} 


failure:: 
{ $(workfile) 


AutoCreate 

EmptyEntireFilePlease 

AppendIfNoSuchLine "copy-chain - no response from $(serve)" 
AppendIfNoSuchLine "copy-chain $(host)=$(value_loadavg) at $(date)" 
} 


leaf:: 
{ $(workfile) 


AutoCreate 

EmptyEntireFilePlease 

AppendIfNoSuchLine "copy-chain $(host)=$(value_loadavg) at $(date)" 
} 


{HAHAHAHA RAHA AAA AAA 
alerts: 
success:: 


"Chain update succeeded" 
PrintFile("S$ (workfile)","6") 


failure:: 
"No Chain update at $(date)" 


Figure 10: A simpler pull version of the aggregation example. 
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